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Abstract 
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The  purpose  of  the  thesis  is  to  present  a  series 
of  models  of  digital  computers  at  the  level  of  the  memory 
processor  interface.  A  discussion  of  computer  instructions 
is  presented  and  the  single  address  format  is  taken  as  the 
prototype  instruction.  The  execution  rate  for  instructions 
of  this  type  is  then  determined  for  several  computer  struc¬ 
tures  of  the  single  processor  and  general  multiprocessor 
types.  The  effect  on  the  execution  rate  of  a  specialized 
processing  activity,  input/output  handling,  is  considered. 
Analytic  models  relate  the  instruction  execution  rate  to 
the  memory  and  processor  speeds,  their  number,  and  their 
interconnection.  Simulation  studies  serve  to  verify  the 
results  of  the  analysis,  A  simple  automatic  design  pro¬ 
gram  is  proposed  which  optimally  configures  computer  struc¬ 
tures  from  a  set  of  available  components. 
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Chapter  I  Introduction 


A.  Computer  Modelling 

The  purpose  of  this  thesis  is  to  present  a  aeries 
of  analytic  models  of  digital  computers.  The  models  relate 
the  performance  of  the  computer  as  measured  by  the  rate  of 
instruction  execution  to  the  specifications  of  its  major 
high  level  components,  their  number,  and  their  intercon¬ 
nection.  The  main  components  considered  are  memories  char¬ 
acterized  by  their  cycle  and  access  times  and  processors 
characterized  by  the  times  required  to  perform  each  of  their 
operations.  (A  detailed  consideration  of  the  computer  com¬ 
ponents  5 s  given  in  chapter  II.) 

There  are  two  reasons  for  doing  the  modelling. 

The  first  is  to  gain  a  quantitative  understanding  of  those 
factors  which  govern  the  performance  of  digital  computers  - 
analysis.  The  second  id  to  assist  in  the  design  of  digital 
computers  -  synthesis. 

A  review  of  the  computer  literature  indicates  that 
computer  modelling  at  the  level  of  the  memory  processor  inter¬ 
face  has  been  neglected.^  The  probable  reason  for  the 
neglect  is  the  mathematical  difficulties  associated  with 
analytic  solutions  of  suitable  models.  A  major  contribution 

1  1  '  "  1  '  '  1 

There  has  been  some  analysis;  these  earlier  results  are 
discussed  in  the  relevent  chapters  of  the  thesis. 
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is  concurrency  and  contemporary  large  computers  such  as  the 
CDC  6600  [Thornton,  19703  and  IBM  560/91  [Anderson,  et  al. . 
1967]  use  concurrency  in  several  parts  of  the  computer.) 

Since  the  operations  comprising  a  single  instruc¬ 
tion  are  normally  intended  to  be  carried  out  sequentially, 
the  presence  of  concurrent  operations  implies  that  for  at 
least  some  of  the  time  more  than  one  instruction  is  in  the 
process  of  being  executed.  The  multiple  instructions 
usually  appear  in  either  of  two  ways:  in  a  multiprocessor 
computer  (with  multiple  instruction  streams  being  simul¬ 
taneously  executed)  or  in  a  single  processor  computer  simul¬ 
taneously  executing  successive  instructions  of  a  single 
instruction  stream. 

The  general  class  of  computer  structures  is  in¬ 
dicated  diagramatically  in  figure  1 .  A  group  of  memories 
(each  indicated  by  an  M)  is  connected  through  a  switch  (S) 
to  a  group  of  processors  (P).  The  memories  are  also  connected 
through  the  switch  to  a  specialized  processor,  an  input/output 
channel  (i/o),  which  is  characterized  by  a  constant  memory 
access  rate.  In  the  general  case  each  of  the  memories  and 
each  of  the  processors  can  be  different  and  the  extent  to 
which  any  given  memory  is  used  by  any  given  processor  can 
be  specified  independently.  For  the  analysis  in  the  thesis 
we  consider  several  special  structures  of  this  general 
class.  Ve  assume  that  all  the  memories  and  all  the  pro¬ 
cessors  are  alike  and  we  assume  •;  hat  there  is  an  equal 
likelihood  that  any'  given  memory  is  used  by  any  given  pro- 
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of  the  thesis  Is  the  development  of  reasonably  simple 
approximate  solution  methods  for  the  models  proposed.  As 
Is  Indicated  In  chapter  V,  simulation  studies  suggest  that 
the  approximate  solutions  are  quite  satisfactory. 

B.  Computer  Analysis 

The  digital  computer  Is  an  Information  processing 
device  and  an  appropriate  measure  of  Its  performance  Is  the 
rate  at  which  the  Information  Is  processed.  The  primitive 
computer  activity  (as  opposed  to  computer  component  activity) 
is  the  execution  of  an  instruction.  If  we  assume  certain 
things  constant  over  a  class  of  computer  structures  to  be 
analyzed  -  specifically,  the  instruction  set  and  the  memory 
word  size  -  then  the  performance  of  the  computers  can  be 
taken  as  equal  the  IER  where  the  IER  is  the  Instruction 
execution  rata.  The  IER  mainly  depends  on  two  factors: 
component  speed  and  concurrency.  The  execution  of  an  In¬ 
struction  (A  discussion  of  computer  instructions  appears 
in  chapter  II  )  involves  a  sequence  of  operations  by  a 
memory  and  a  processor.  The  IER  is  determined  not  only  by 
how  fast  these  operations  are  carried  out  but  also  by  the 
number  of  operations  being  carried  out  simultaneously.  (As 
a  practical  matter  the  subject  of  concurrency  is  a  rather 
important  one.  Technology  imposes,  at  any  given  time,  limits 
on  how  fast  basic  operations  can  take  place  and  the  only 
remaining  factor  that  can  be  used  to  increase  performance 
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is  concurrency  and  contemporary  large  computers  such  as  the 
CDC  6600  [Thornton,  19703  and  IBM  360/91  [Anderson,  et  al.t 
19673  use  concurrency  in  several  parts  of  the  computer.) 

Since  the  operations  comprising  a  single  instruc¬ 
tion  are  normally  intended  to  be  carried  out  sequentially, 
the  presence  of  concurrent  operations  implies  that  for  at 
least  some  of  the  time  more  than  one  instruction  is  in  the 
process  of  being  executed.  The  multiple  instructions 
usually  appear  in  either  of  two  ways:  in  a  multiprocessor 
computer  (with  multiple  instruction  streams  being  simul¬ 
taneously  executed)  or  in  a  single  processor  computer  simul¬ 
taneously  executing  successive  instructions  of  a  single 
instruction  stream. 

The  general  class  of  computer  structures  is  in¬ 
dicated  diagr  statically  in  figure  1.  A  group  of  memories 
(each  indicated  by  an  M)  is  connected  through  a  switch  (S) 
to  a  group  of  processors  (P).  The  memories  are  also  connected 
through  the  switch  to  a  specialized  processor,  an  input/output 
channel  (i/o),  which  is  characterized  by  a  constant  memory 
access  rate.  In  the  general  case  each  of  the  memories  and 
each  of  the  processors  can  be  different  and  the  extent  to 
which  any  given  memory  is  used  by  any  given  processor  can 
be  specified  independently.  For  the  analysis  in  the  thesis 
we  cc .aider  several  special  structures  of  this  general 
class.  We  assume  hat  all  the  memories  and  all  the  pro¬ 
cessors  are  alike  and  we  assume  that  there  is  an  equal 
likelihood  that  any-  given  memory  is  used  by  any  given  pro- 
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cessor.  We  divide  the  structures  into  two  classes:  single 
processors  (a  single  P  on  figure  1)  and  multiprocessors 
and  provide  a  different  type  of  analysis  for  each.  For  the 
single  processors  (analyzed  in  chapter  III)  we  introduce 
a  notation  called  an  instruction  timing  diagram  and  with 
the  aid  of  it  directly  compute  the  instruction  execution 
time  and  then  the  IER.  The  basic  approach  taken  is  to  add 
the  average  delay  in  accessing  memory  to  the  processor 
time  to  get  the  total  instruction  execution  time.  This 
approach,  while  both  straightforward  and  conveniently 
used  to  examine  processor  features  in  detail,  is  very 
awkward  to  apply  to  multiprocessor  computers,  and  another 
approach  is  indicated.  For  multiprocessors  (analyzed  without 
i/o  in  chapter  IV  and  with  i/o  in  chapter  VI)  we  introduce 
a  special  instruction  form  called  a  unit  instruction  which 
allows  us  to  determine  the  IER  directly  in  terms  of  the 
rate  of  memory  cycle  utilization.  The  utilization  is 
determined  by  an  approach  related  to  the  occupancy  problem 
of  combinatorial  analysis. 

C.  Computer  Synthesis 

The  models  developed  in  the  thesis  relate  the 
performance  of  the  computer  to  the  number,  specifications, 
and  interconnection  of  its  components.  These  variables 
are  also  those  to  which  the  cost  of  the  computer  is  related. 
If  both  cost  and  performance  are  related  to  common  vari¬ 
ables,  it  is  possible  to  formulate  a  design  procedure  that 
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will  choose  computer  configurations  that  are  optimum  with 
respect  to  certain  design  criteria. 

Synthesis  procedures  are  usually  iterative  as 
indicated  in  figure  2.  A  series  of  potential  design  con¬ 
figurations  are  generated,  analyzed,  and  tested  against 
the  design  criteria.  The  best  of  the  configurations  (as 
measured  against  the  design  criteria)  is  chosen.  The 
heart  of  the  synthesis  procedure  is  the  analysis  part. 

The  generation  of  the  proposed  configurations  can  be  either 
simple  in  that  all  configurations  (in  some  prespecified 
design  space)  are  generated  or  more  complex  in  that  the 
configurations  generated  are  dependent  on  the  results  of 
analysis  and  testing  of  earlier  configurations. 

The  purpose  of  this  thesis  is  not  to  consider 
design  procedures  in  detail.  However,  to  illustrate  the 
utility  of  the  thesis  analysis  in  design,  a  simple  design 
program  is  presented  in  chapter  VII.  The  program  chooses  a 
configuration  (in  other  words,  it  picks  the  number  of 
memories  and  processors  and  their  speeds)  so  as  to  realize 
a  desired  IER  at  a  minimum  cost. 


Generate  a  configuration 


_ 1 

f 

Analyze  the  configuration 

\ 

f 

Test  the  results  of  the 
analysis  against  the 
design  criteria 

I 


Exit  with  the  best  configuration 


Figure  2.  Synthesis  Procedure 
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Chapter  II  Computer  Components  and  Instructions 
A.  Memories 

The  structure  and  organization  of  the  digital 
computer  is  influenced  by  both  what  we  think  the  computer 
should  be  and  by  the  technology  of  the  computer  components. 
Probably  memory  technology  has  had  more  influence,  histor¬ 
ically,  on  computer  design  than  any  other  factor. 

The  purpose  of  the  memory  is  to  hold  programs 
(sets  of  instructions)  and  the  data  (sets  of  operands) 
to  be  processed.  The  time  required  to  obtain  information 
from  memory  thus  strongly  influences  (limits)  the  instruction 
execution  rate.  For  economic  reasons  it  is  generally 
impossible  to  provide,  in  a  single  memory,  both  sufficient 
speed  of  operation  to  realize  an  adequate  IER  and  still 
have  adequate  memory  size  (number  of  memory  words)  to 
hold  all  the  programs  and  data  associated  with  a  computer 
system.  At  times  even  a  single  program  and  its  related 
data  may  face  this  limitation.  Consequently!  it  is  conven¬ 
tional  that  at  least  two  forms  of  memory  be  present  in  a 
computer  system:  primary  and  secondary. 

The  primary  memory  is  a  relatively  small  but 
fast  storage  area  for  instructions  and  operands  that  can 
be  directly  operated  on  by  the  processor.  The  information 
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stored  in  the  secondary  memory  cannot  be  acted  on  directly 
by  the  processor;  it  must  first  be  transferred  to  primary 
memory.  The  transfer  of  information  between  primary  and 
secondary  memories  is  an  important  activity  in  digital 
computer  systems,  both  because  it  interferes  with  processor 
access  to  primary  memory  and  because  there  is  usually  a  large 
access  time  associated  with  secondary  memories. 

Because  of  the  general  difficulty  (or  impossi¬ 
bility)  of  predicting  the  processor  accessing  pattern,  an 
important  requirement  for  primary  memories  is  a  random 
access  characteristic.  For  such  memories,-  the  time  required 
to  access  any  word  of  information  is  independent  of  its  loca¬ 
tion  in  memory;  in  particular  it  is  independent  of  the  re¬ 
lation  between  the  location  referenced  and  the  last  referenced 
location. 


At  the  time  of  writing  the  most  common  form  of 
primary  memory  is  the  magnetic  core  type.  Magnetic  core 
memories  typically  used  in  computers  have  word  sizes  from 
eight  to  over  100  bits  (possibly  500  bits)  and  have  total 
word  capacities  from  about  4K  (K  =  1024)  to  64K.  (The 
sizes  given  are  typical  for  single  memory  units.  The  en¬ 
tire  computer  primary  memory  may  be  made  up  from  a  number 
of  memory . units. )  Core  memories  have  complete  operation  or 
cycle  times  in  the  range  of  0.5  usee,  to  10  usee.  The  read¬ 
out  of  information  in  a  magnetic  core  is  inherently  a  des¬ 
tructive  process;  that  is,  the  contents  of  the  memory  lo- 
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cation  are  lost  when  read.  Since  this  is  usually  unde¬ 
sired,  the  information  must  also  be  restored  in  the  memory 
after  being  read.  This  re-writing  takes  an  additional 
amount  of  time.  Once  the  information  is  read  out,  however, 
it  is  immediately  available  to  a  processor;  the  latter  need 
not  wait  for  the  restore  time.  The  time  elapsing  between 
the  initiation  of  a  memory  request  and  the  time  the  in¬ 
formation  becomes  available  is  the  access  time;  it  is  typ¬ 
ically  30#  to  50%  of  the  cycle  time. 

The  cost  of  magnetic  core  memories  is  generally 
related  to  the  cycle  time  and  a  very  rough  approximation 
would  give  the  cost  proportional  to  the  reciprocal  of  the 
cycle  time.  Core  memories  are  frequently  of  the  coinci¬ 
dent  current  type  and  in  these  the  cost  of  the  electronic  part 
of  the  memory  is  roughly  proportional  to  the  square  root 
of  the  memory  size.  The  balance  of  the  memory  cost  is 
directly  proportional  to  the  memory  size,,  Thus,  the  cost 
per  word  is  lower  in  large  memories  than  in  small;  in  par¬ 
ticular  a  memory  of  w  words  costs  less  than  m  memories 
of  w/m  Words. 

Another  form  of  primary  memory  is  the  transistor 
register  type.  These  memories  are  characterized  by  very 
fast  access  times  of  about  23  to  100  nsec.  Their  cost, 
though,  is  such  as  to  preclude  their  general  use  as  pri¬ 
mary  memory.  It  is  common,  however,  for  contemporary 
computers  to  provide  a  small  amount  (typically  8  to  64 
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words)  of  primary  memory  in  registers.  The  registers  are 
often  addressable  as  if  they  were  locations  in  core  memory. 

The  secondary  memories  store  the  large  bulk  of 
information  in  the  digital  computer  system.  The  principal 
types  of  secondary  memories  are  rotating  discs  and  drums  and 
linear  magnetic  tapes.  The  cost  of  storage  per  bit  in  secon¬ 
dary  memories  is  about  \%  to  10 %  of  that  in  primary  memory. 

The  low  cost  and  large  capacity  of  secondary  memories  is  due 
to  the  fact  that  they  are  not  of  the  random  access  type.  The 
access  time  is  dependent  on  the  relation  between  the  last 
and  currently  accessed  data  and  the  time  elapsed  since  that 
last  access.  The  average  access  time  for  randomly  located 
information  is  half  the  time  for  a  revolution  (about  10  msec.) 
in  rotating  memories  and  the  time  to  search  half  the  tape 
(a  number  of  seconds)  in  magnetic  tape  memories.  The  max¬ 
imum  rate  of  information  flow  in  a  random  access  memory  is 
the  reciprocal  of  the  cycle  time;  in  a  non-random  access 
memory  the  maximum  is  obviously  not  the  reciprocal  of  the 
average  access  time.  For  the  type  of  highly  structured 
information  flows  that  take  place  between  the  primary  and 
secondary  memories,  the  word  flow  rate  (particularly  from 
drums)  may  appraoch  that  obtainable  from  primary  memory. 

B#  Processors 

The  purpose  of  the  processor  is  to  obtain  in¬ 
structions  and  operands  from  memory,  decode  the  instructions, 
and  perform  the  required  operation  on  the  operands.  Often 
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the  speed  at  which  arithmetic  operations  are  carried  out 
strongly  influences  the  IER,  and  it  is  fairly  conventional 
to  characterise  processors  by  their  arithmetic  capabilities. 
There  are  two  principal  types  of  arithmetic  operands:  fixed 
point  (integer)  and  floating  point  (fraction  plus  exponent). 
Small  to  medium  size  computers  usually  have  only  fixed  point 
arithmetic  operations  built  in;  floating  point  operations  are 
programmed..  The  smaller  computers  may  have  only  fixed  point 
add  and  subtract  and  even  fixed  point  multiply  and  divide 

i 

must  be  programmed.  For  the  purpose  of  discussing  processor 
times,  instruction  decoding  and  other  types  of  operations 
implemented  such  as  logical  and  control  operations  can  be 
grouped  with  the  fixed  point  add  and  subtract  instructions. 

At  the  present  time  these  types  of  operations  generally  re¬ 
quire  on  the  order  of  a  few  tens  of  nsec,  to  a  few  hundred 
nsec,  A  fixed  point  multiply  or  divide  takes  from  a  few  hun¬ 
dred  nsec,  to  several  usee,  depending  on  the  processor.  Float¬ 
ing  point  operations,  when  implemented,  have  execution  times 
in  the  range  of  about  0.4  to  15  usee.  The  important  rela¬ 
tions  in  determining  IER  are,  as  we  shall  see  later,  between 
the  processor  times  and  the  memory  restore  times.  From  the 
proceeding  discussion  we  surmise  that  for  most  contemporary 
computers  the  basic  fixed  point  operations  require  less  time 
to  execute  than  the  memory  restore  time.  The  balance  of  the 
other  operations  may,  depending  on  the  particular  memory  and 
processor,  have  operation  times  greater  or  less  than  the  mem¬ 
ory  restore  time. 


C.  The  Computer  Instruction 

We  shall  take  as  our  basic  computer  instruction 
the  implementation  of  a  binary  (two  operand)  operation. 

A  general  binary  operation  can  be  represented  as: 

result  operand  1  [operator  ]  operand  2. 

In  a  computer  ve  refer  to  Information  by  its  location  in 
memory  (specified  by  a  memory  address)  and  consequently 
the  prototype  instruction  is  written: 

1.  (A)-*—  (B)  £  operator  J  (C) 

2.  Take  (D)  as  the  next  instruction. 

The  notation  (X)  means  the  contents  of  the.  memory  location 
specified  by  address  X.  Thus  B  and  C  are  the  operand  ad¬ 
dresses;  A  is  the  result  address.  The  second  part  of  the 
instruction  is  necessary  because  instructions  are  executed 
as  part  of  a  sequence;  the  location  of  the  next  instruction 
of  the  sequence  is  specified  by  address  D. 

As  we  can  see  there  are  four  memory  addresses 
(A,  B,  C,  D)  associated  with  the  prototype  instruction. 

These  addresses  must  be  specified  somehow,  but  they  need  not 
be  explicitly  included  in  the  instruction;  they  can  be  speci¬ 
fied  in  an  implicit  manner.  The  reasons  for  preferring  im¬ 
plicit  specification  of  addresses  are  due  to:  (1)  a  potential 
improvement  in  the  IER  by  reducing  the  number  of  memory  re¬ 
ferences  which  must  be  made  to  execute  the  instruction  and 
(2)  a  reduced  amount  of  memory  space  required  to  hold  the 
Instruction.  (If  it  requires  k  bits  to  encode  a  memory 
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address  and  j  bits  to  encode  the  instruction  operation,  then 
an  instruction  with  n  explicit  addressed  requires  n  k  ♦  j 
bits  of  memory  to  hold  it.) 

f 

In  real  computers  a  common  implementation  of  the 
basic  instruction  is  one  called  the  single  address  format. 

A  register  called  the  accumulator  is  implicitly  specified 
as  both  the  location  of  one  of  the  operands  and  the  location 
of  the  result.  The  next  instruction  address  is  implicitly 
taken  as  the  address  of  the  current  instruction  plus  one. 

The  single  address  format  requires  two  memory  references: 
one  for  the  instruction  itself  and  one  for.  the  operand.  The 
single  address  format  is  used  as  a  prototype  instruction  for 
the  purposes  of  the  subsequent  analysis  and  the  computer  in¬ 
struction  v-  c  is  assumed  to  be  made  up  entirely  of  this  type 
of  Instruction.  In  addition,  each  instruction  and  each 
operand  is  assumed  to  occupy  exactly  one  memory  word. 

D.  Instruction  Timing  Diagram 

The  execution  of  an  instruction  of  the  single 
address  format  involves  the  following  steps: 

1 .  The  instruction  itself  is  fetched  from  the 
memory.  The  memory  address  of  the  instruction 
is  specified  implicitly  by  the  address  of  the 
previous  instruction  plus  one. 

2.  The  instruction  once  received  from  memory  is 
decoded  yielding  the  operation  to  be  performed  and 
the  address  of  the  operand  to  be  used. 
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Legend: 

1  Instruction  fetch 

2  instruction  decoding 

3  operand  fetch 

4  instruction  execution 

5  next  instruction  fetch 
t&  memory  access  time 

tw  memory  restore  time 
tg  instruction  decode  time 
tei  processor  execution  time 


Figure  3*  Instruction  Timing  Diagram 
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3.  The  operand  is  fetched  from  memory. 

4.  Once  the  operand  is  received  by  the  processor 

the  operation  is  performed. 

This  sequence  of  events  may  be  indicated  diagrama tic ally 
by  plotting  simultaneously  against  time  the  memory 
and  processor  activities.  See  figure  3,  This  construction 
is  termed  an  instruction  timing  diagram  (ITD).  The  ITD 
will  be  used  to  visualize  the  instruction  execution  and 
(as  will  be  seen  in  chapter  III)  to  compute  the  amount  of 
time  required  to  execute  the  instruction.  The  general 
approach  to  be  taken  in  computing  instruction  execution 
times  from  the  ITD  is  to  pick  corresponding  points  on  suc¬ 
cessive  instructions  and  determine  the  time  between  them. 
(Such  points  are  indicated  by  A 's  on  the  of  figure  3d 
Sometimes  the  appropriate  choice  of  corresponding  points 
facilitates  the  determination  of  the  instruction  execution 
time. 


If  there  is  more  than  just  one  memory  capable  of 
simultaneous  operation  the  ITD  is  easily  extended  to  handle 
this  case.  In  the  following  we  have  two  memories  and 

M2;  the  instruction  reference  goes  to  and  the  operand 

* , 

reference  to  M2. 
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The  interesting  thing  to  note  here  is  that  once  the  in¬ 
struction  is  decoded  the  operand  reference  can  be  immed¬ 
iately  made  to  Mg.  The  time  required  to  execute  the  in¬ 
struction  (assuming  td<t#)  is  shortened;  this  is  a  result 
of  the  concurrent  operation  for  a  time  of  both  memories. 
Further  discussion  of  this  is  deferred  until  chapter  III. 


E.  The  Instruction  Execution  Rate 

The  computer  instruction  set  is  assumed  to  consist 
of  a  set  of  instructions  Id  of  the  single  address  instruc¬ 
tion  format  each  with  an  associated  value  of  processor  ex¬ 
ecution  time  tQi.  The  value  of  decode  time  td  is  assumed 
to  be  constant  for  all  instructions.  By  examining  the  ITD 
we  can  determine  a  value  t d  to  execute  Id  where  ti  is  clearly 
a  function  of  tQi,  td,  t&,  tw.  In  section  D  we  saw  that  the 
number  of  memories  influenced  the  instruction  execution 
time;  hence  td  is  also  a  function  of  the  memory  structure. 

For  the  time  being  let  us  simply  associate  the  memory  struc¬ 
ture  with  a  variable  S.  Then  we  can  write: 

t^  =  h(tQi,  td,  t&>  tw,  S) .  (2.1) 

Associated  with  each  instruction  1^  is  a  probability  or  rel¬ 
ative  frequency  which  gives  the  likelihood  that  any  given 
instruction  is  of  type  i.  The  average  time  E(t)  required 
to  execute  an  instruction  is  computed: 

E(t)  =  TL  (2.2) 

i  1  1 
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Tho  Instruction  execution  rate  is  the  reciprocal  of  the 
average  execution  time;  hence# 

im  -  i/(I3  Vd.)* 

i  1 


(2.3) 
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Chapter  III  Single  Processor  Computers 
A.  Single  Memory 

Given  that  we  have  a  single  processor  organi¬ 
zation  and  a  set  of  instructions  in  the  single  address 
format  with  relative  frequencies  f^  and  processor  execu¬ 
tion  times  tQi  ,  we  now  wish  to  compute  the  IER.  When 
there  is  a  single  memory  the  ITD  is: 


Since  in  chapter  II  we  indicated  that  for  most  practical 
situations  td<tw  ,  the  important  timing  relationship  is 
that  between  tQi  and  tw.  The  instructions  fall  naturally 
into  two  classes: 

class  1  -  te±«tw 
class  2  -  tei>tw. 

For  instructions  of  class  1,  the  execution  time  for  instruc¬ 
tion  1^  is: 


t±  =  2ta  *  2tw  .  2tc 


(3.D 
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where  tc  =  t  +  tw  is  the  memory  cycle  time.  For  instruc¬ 
tions  of  class  2,  the  execution  time  is 


ti  *  2ta  +  *w  +  fcei  #  (3.2) 

The  average  execution  time  for  an  instruction  is  defined 

as  in  chapter  II: 


E(t)  =  q  f±ti 

<=  E  M20  ♦  E  (2t  +  t  ♦  t  ) 
Let  1  c  i*c2  1  a  »  91 


(3.3) 


where  i*c1  and  i«c2  mean  those  subscripts  which  apply  to 
classes  one  and  two  respectively.  Now  let  us  make  the 
following  definitions; 

fl  =  2T  f±  (3.4a) 

i*c1  1 

f2  *  Efi  (3.4b) 

i*c2  x 

t.1  -  O/fl)  E  f,t  ±  (3.4c) 

e  i«c1  1  el 

t  2  =  (1/f2)  E  f±t.±  .  (3.4d) 

e  i«c2  1  ei 

From  these  definitions,  fl  and  f2  are  the  relative  fre¬ 
quencies  of  all  instructions  of  class  1  and  class  2  re¬ 
spectively.  Similarly,  t_1  and  t  2  are  the  average  re- 
spective  processor  execution  times  for  instructions  of 
class  1  and  class  2.  Substituting  equations  3.4  in  equa¬ 
tion  3.3  gives: 

E(t)  =  fl(2tc)  +  f2(2ta  +  tw  ♦  t02).  (3.5) 

The  form  of  the  result  suggests  a  simple  way  to  compute  E(t). 

The  average  execution  times  (t  1  and  t_2)  are  substituted 

©  0 
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f°r  t0i  in  the  general  expressions  for  the  instruction 
execution  times  given  by  equations  3.1  and  3.2  and  then 
the  resulting  values  are  weighted  by  the  respective  class  rel¬ 
ative  frequencies.  This  approach  works  because  of  the 
linearity  of  the  averaging  operation  and  applies  not  only 
to  the  results  of  this  section  but  also  to  those  of  the 
subsequent  sections  as  well.  Hence  we  can  disregard  the 
detail  of  the  instruction  set  and  in  the  subsequent  anal¬ 
ysis  perform  only  two  computations  analogous  to  those 
represented  by  equations  3.1  and  3.2.  The  instruction 
execution  rate  is  obtained  by  taking  the  reciprocal  of 
E(t)  as  defined  by  equation  3.5: 

IER  =  1/(f1(2tc)  +  f2(2ta  +  tw  +  tQ2) ) .  (3.6) 

Since  fl  +  f2  *  1  and  t  =  t  +  t  the  latter  becomes: 

c  w  a 

IER  -  1/(2tc  ♦  f2(t02  -  tw)).  (3.7) 

B.  Interleaved  Memory 

We  can  see  from  equation  3.7  that  even  making 
the  processor  arbitrarily  fast  (which  makes  f2  =  0)  cannot 
provide  an  IER  greater  than  1/2t_  .  The  IER  could  be 

V 

increased  if  it  were  possible  to  have  the  instruction  and 
its  operand  in  different  memories.  This  cannot  be  done 
with  certainty  without  greatly  reducing  the  memory  utility1 , 

'The  obvious  way  to  do  this  is  to  have  separate  memories 
for  the  instructions  and  the  operands.  This  approach 
eliminates  the  (little  used)  generality  to  use  operands 
as  instructions  and  conversely.  More  importantly  it  elim¬ 
inates  the  ability  to  apportion  freely  the  memory  between  oper¬ 
ands  and  instructions  as  the  need  arises. 


T 
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but  we  can  propose  a  memory  organization  that  achieves 
this  with  a  high  probability.  Suppose  we  have  m  inde¬ 
pendent  memories.  We  arrange  the  memory  addressing  so 
that  successive  addresses  are  in  different  memories.  In 
particular  if  the  memories  are  denoted  Mq  Mm-1  and 

address  0  is  in  MQ  then  address  z  is  in  M^z  m0(jui0 
Such  an  addressing  scheme  tends  to  uniformly  distribute 
the  operands  and  the  instructions  among  all  the  memories 
regardless  of  their  particular  addresses.  We  can  then 
make  the  reasonable  assumption  that  the  probability  of 
any  particular  memory  reference  being  directed  to  any 
particular  memory  is  1/m  .  Equivalently,  if  a  memory 
reference  goes  to  the  probability  that  the  succeeding 
reference  also  goes  to  Mj  is  1/m  .  The  probability  that 
reference  does  not  go  to  Mj  is  1  -  1/m  .  This  type  of 
memory  organization  is  called  an  interleaved  memory.  The 
ITD  for  the  interleaved  memory  case  with  no  addressing 
conflicts  is: 


I 


23 


We  note  that  the  value  of  t^  is  no  longer  deterministic; 
rather  t^  is  a  random  variable.  In  the  following  we  compute 
the  average  value  of  t^  ;  however  for  simplicity  there  is 
no  new  notation  introduced  to  indicate  that  it  is  an  average 
value.  We  recall  that  the  general  idea  in  using  the  ITD 
to  compute  instruction  timing  is  to  pick  corresponding 
points  on  successive  instructions  and  then  determine  the 
time  elapsed  between  the  points.  On  the  above  diagram  the 
points  used  are  indicated  by  A's.  The  potential  conflicts 
are  indicated  by  the  numbers  1  and  2  on  the  ITD.  The  time 
elapsed  between  the  first  A  and  point  1  is  t  +  td  .  With 
probability  1/m  a  delay  of  tw  -  td  is  encountered  at  this 
point.  From  point  1  to  point  2  a  time  t  +  t0i  elapses. 

With  probability  1/m  a  further  delay  of  tw  -  t0i  is  en¬ 
countered  before  the  next  instruction  can  begin.  Hence; 

*  t&  +  td  ♦  (1/m)(tw  -  td)  +  tR  + 

tel  ♦  (1/m)(tw  -  tei) 

*  2ta  ♦  td  ♦  tei  *  (1/m)(2tw  -  td  -  tel). 

(3.8) 

We  have  implicitly  assumed  above  that  the  instructions  were 
of  class  1 ;  for  instructions  of  class  2  no  delay  can  be 
encountered  at  point  2.  In  a  manner  similar  to  the  pre¬ 
ceding  we  find  for  instructions  of  class  2  that; 

*  2ta  *  *d  *  *81  *  <'/“>(*.  -  V-  (3.9) 
E(t)  Is  now  found  using  the  method  of  section  A: 

E(t)  -  f1(2ta  *  td  *  te1  ♦  (t/m)(2tw  -  td  -  te1)) 

♦  f2(2ta  *  td  ♦  te2  ♦  (1/m)(tw  -  td)) 
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'  2ta  *  *d  *  ‘em  *  '  *d>  *  -  V> 

(3.10) 

where  t  is  the  average  value  of  processor  execution  time 
em 

for  all  instructions: 

tAB1  =*  ft(tj)  +  f2(tft2)  •  (3.11) 

em  6  6 

We  now  compute  the  IER: 

IEH  -  1/(2ta  ♦  td  ♦  (1/m)(tw  -  td)  ♦  t6m  ♦ 

(fl/mXt,,  -  te1»-  (3.12) 

For  a  large  value  of  m  the  IER  becomes: 

IER  2  1/(2ta  +  td  ♦  tem).  (3.13) 

Hence  for  an  arbitrarily  fast  processor  (td»tem«cta) 
the  maximum  IER  is  about  twice  that  obtainable  with  a  n on- 
interleaved  memory  system. 

C.  Interleaved  Memory  —  Alternative  Analyeis 

In  the  last  section  we  considered  all  memory 
references  to  be  random  with  the  probability  of  a  reference 
to  a  particular  memory  1/m.  Hence,  with  this  definition, 

p 

there  is  a  non-zero  probability  1/m  that  an  operand  refer¬ 
ence  conflicts  with  the  instruction  reference  and  that  the 
succeeding  instruction  reference  conflicts  with  that  operand 
reference.  This  implies  that  two  successive  instructions 
(which  occupy  successive  memory  locations)  are  located  in 
the  same  memory.  But  the  interleaving  scheme  we  have  suggested 
generally  avoids  this.  It  is  of  interest  then  to  compute 
E(t)  if  these  double  conflicts  were  eliminated.  (  The 
double  conflicts  can  occur  only  when  and  we  shall 
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assume  this  for  the  following  discussion.)  Figure  4  shows 
a  tree  structure  of  possible  operation  sequences.  This  is 
presented  to  aid  in  the  analysis.  The  probability  of  get¬ 
ting  from  one  node  of  the  tree  to  another  is  the  product 
of  the  probabilities  of  all  the  branches  connecting  the 
nodes.  The  probabilities  of  the  branches  are  obtained  as 
follows: 

1.  The  probabilities  of  branches  1-2  and  1-3  are 
just  those  normally  associated  with  the  conflict 
of  an  operand  reference  with  the  proceeding 
instruction  reference.  Hence  they  are  1/m  and 

1  -  1/a  respectively. 

2.  From  the  proceeding  discussion  branch  2-4 
represents  an  impossible  situation.  Hence  the 
probability  of  branch  2-4  is  zero  and  the  prob¬ 
ability  of  branch  2-3  is  one. 

3.  The  probability  of  branch  3-6  is  not  (as  might 
be  expected)  1/m  but  rather  1/(m-1).  Once  an 
instruction  is  obtained  from  a  memory,  say  Mj  , 
and  the  operand  reference  is  known  not  to  conflict 
with  that  of  the  instruction,  the  operand  must 
have  been  chosen  from  one  of  the  remaining  memories 
not  including  Mj  .  Since  the  next  instruction 
reference  is  also  made  to  one  of  these  memories, 
the  probability  of  a  conflict  is  1/(m  -  1).  The 
probability  of  branch  3-7  is  then  1  -  1/(m  -  1). 


next  instruction 
reference 


p  —  probability  of  branch 


Figure  4.  Instruction  Execution  Tree 
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The  times  next  to  the  terminal  nodes  Indicate  the  Instruc¬ 
tion  times  for  the  sequence  ending  at  the  node;  they  are 
readily  obtained  from  the  ITD.  We  now  compute  t^: 

tj  =  5Z  (time  from  node  1  to  node  j )( probability 
1  3=4, 5, 6, 7 

of  getting  from  node  1  to  node  j) 
a  (1/m)(ta  ♦  tw  ♦  tei)  +  (1  -  1/m)(l/(m  -1)) 
(ta  «■  td  ♦  tw)  +  (1  -  1/m) ( 1  -  1/(m  -  1))  x 

(ta  *  *4  *  *a  *  tei) 


*  2ta ’  *d  *  *61  *  *  *ei  -  *d>- 


(3.14) 


which  is  exactly  the  same  as  is  obtained  by  the  previous, 
simpler  analysis. 


D.  Instruction  Buffering 

With  an  m-way  interleaved  memory  it  is  possible 
to  simultaneously  obtain  the  contents  of  m  successive  mem¬ 
ory  locations.  Since  successive  instructions  are  normally 
located  in  successive  memory  locations,  it  is  possible  to 
organize  the  processor  to  perform  the  instruction  references 
for  m  instructions  simultaneously  (the  current  instruction 
reference  and  the  next  m  -  1  instruction  references).  Let 
us  assume  that  the  m  instructions  obtained  are  stored  in  m 
fast  processor  registers  with  an  access  time  tr«tft  .  The 
instructions  are  then  obtained  from  the  registers  as  needed. 
Now  up  to  s<m  instructions  (why  s  can  be  less  than  m  is 
dicussed  shortly)  can  be  executed  which  have  an  ITD  as 
follows : 


4  if** 
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We  see  that  (except  for  the  first  instruction)  the  fetching 

and  decoding  operations  of  the  subsequent  instruction  can 

be  grouped  with  the  processor  execution  of  the  current 

instruction.  This  makes  it  appropriate  to  redefine  (for 

this  section  only)  classes  1  and  2: 

class  1 :  t  .  +  t  +  t.  1„, 
ei  r  d  w 

class  2:  t0i  +  *r  +  >  t?/  . 

Using  the  approach  of  section  R  the  values  of  t1  are  readily 
obtained: 


tj^  »  t&  +  (tQ^  +  ty  +  tjj)  +  (1/m)(tw 
for  instructions  of  class  1  and 


(3.15) 


t±  =  ta  +  tp  +  td  +  te±  (3.16) 

for  instructions  of  class  2.  The  time  required  to  fetch 
the  m  instructions  is  tc  because  the  operand  reference  of 
the  first  instruction  necessarily  conflicts  with  the  instruc¬ 
tion  fetch.  To  compute  E(t)  for  a  single  instruction,  we 
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apportion  the  initial  instruction  fetch  time  among  the  a 


instructions  executed.  Hence  using  equations  3.15  and  3.16 
we  compute  E(t): 


E(t)  =  tc/s  +  f2(ta  +  td  +  tQ2)  + 

f1(ta  +  te1  +  *d  +  *r  +  (1/m>  X 

(t,„  -  t  1  -  t  -  t,)) 

'  vv  e  a  d 

=  t/e  +  (t  +  t.  +  t  +  t)  + 

c  a  d  em  r 

(f1/m)(tw  -  te1  -  td  -  tr). 


The  IER  is  then  computed: 


IER  =  1/(tc/s  ♦  (ta  ♦  td  +  tem  ♦  tp)  ♦ 
(f1/m)(tw  -  t01  -  td  -  tr)). 


(3.17) 


(3.18) 


The  reason  that  less  than  m  instructions  can  be 
executed  is  due  to  the  presence. of  branch  instructions  in 
the  instruction  stream.  Such  instructions  cause  the  pro¬ 
cessor  to  take  the  next  instruction  from  a  memory  location 
that  is  not  the  next  successive  memory  location  after  the 
branch  instruction.  Suppose  that  the  relative  frequency 
of  branch  instructions  is  f^.  We  shall  characterize  the 
instruction  set  by  assuming  that  there  is  a  constant  prob¬ 
ability  =  f^  that  any  given  instruction  is  a  branch 
instruction.  Hence  there  is  a  probability  1  -  Pb  that  am 
instruction  is  not  a  branch  instruction.  We  now  compute 
the  probability  that  a  sequence  of  k  instructions  are  ex¬ 
ecuted,  For  1<k<m  there  must  be  k  -  1  instructions  that 
are  not  branches  followed  by  a  branch  instruction.  Let  X 
be  a  random  variable  equal  to  the  number  of  instructions 
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executed.  I^om  the  proceeding  discussion: 

p(X  =  k)  =  p(k)  =  (1  -  Pb^^Pb*  k  =  1 , , . .  ,m  -  1; 

(3.19) 

where  p(k)  is  the  probability  of  the  k  instruction  se- 

4. 

quence.  Regardless  of  whether  the  m  instruction  is  a 
branch  instruction  or  not,  the  execution  sequence  terminates 
at  the  ml  instruction  if  it  has  not  terminated  earlier. 
Thus: 

p(m)  =  1  -  p(X<m) 

=  i  -  z:  o-pb)k“1pb-  (3.20) 

k=1  D  D 

Using  the  fact  that  the  sum  of  a  finite  geometric  series 
is: 

2:  ak  .  (a  -  ant,)/0  -  a) 
k=1 

we  reduce  equation  3.20  to 

p(m)  =  (1  -  p,,)”*1.  (3.aD 

The  expected  number  of  instructions  executed  E(X)  is  then: 

E(X)  =  23  kp(k) 
k=  1 

Id-1 

=  5Zk  pb(1  -  Pb)1*”1  +  m(1  -  Pb)111"1.  (3.22) 

Writing  the  summation  as 


m-1 


Pb 


d(1-pb)  k=1 

and  using  the  previously  mentioned  relation  for  the  sum  of 
a  finite  geometric  series,  we  reduce  equation  3.22  to 


mm 


m 
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E(X)  =  (1  -  (1  -  Pb)“)/Pb  .  (3-23) 

The  expression  for  E(X)  can  now  be  substituted  for  s  in 
equation  3*18  yielding  the  IER  for  the  buffered  instruc¬ 
tion  case: 


IER  = 


v, 


i  -  a  -  ?„> 


—  +  t  +t,+  t  +t  +  — — (t  -t  1  -t  ,-t) 
m  a  d  r  em  m  w  e  dr7 


(3*24) 


For  large  values  of  m  and  small  values  of  td  and  t  , 
equation  3*25  becomes: 

IER~  1/(pbtc  *  ta*  tem).  (3.25) 

Finally,  for  a  fast  processor  and  a  low  value  of  pfe  (for 
scientific  computing  pb  probably  lies  in  the  range  of  about 
0.05  to  0.3)  the  IER  approaches  1/t  which  is  about  twice 

a 

the  IER  in  the  results  of  section  B. 

D.  Instruction  Prefetch 


Normally  computer  instructions  are  intended  to 
be  sequentially  executed:  in  a  stream  of  instructions,  the 
execution  of  instruction  x  +  1  does  not  begin  until  the 
execution  of  instruction  x  is  completed.  However,  it  is 
possible  to  organize  the  processor  so  that  more  than  one 
instruction  is  being  executed  at  a  time.  It  is  possible 
to  go  to  processors  of  considerable  complexity  (as  for  ex¬ 
ample  in  the  CDC  6600  and  the  IBM  360/91)  in  order  to  max¬ 
imize  the  overlapping  of  instruction  execution.  In  this 
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section,  however,  we  will  discuss  a  modest  form  of  con¬ 
currency  of  instruction  execution:  the  instruction  pre¬ 
fetch.  The  idea  of  the  instruction  prefetch  is  quite  simple: 
the  overlapping  of  the  subsequent  instruction  fetch  with 
the  processor  execution  time  of  the  current  instruction. 

The  ITD  for  this  case  with  no  addressing  conflicts  is: 


t 

, - a. 


w 


Lnext  instruction  begins 
t., 


w 


I 


instruction  enfls 


A 


From  the  ITD  we  observe  that  the  value  of  t0i  is  not  going 
to  appear  in  the  expression  for  t^  .  This  perhaps  sur¬ 
prising  result  is  a  general  feature  of  this  type  of  con¬ 
currency.  The  rate  at  which  instructions  are  executed  is 
dependent  on  the  time  which  elapses  between  instructions 
commencing  execution  and  not  on  the  time  required  to  ex¬ 
ecute  a  given  instruction.  There  are,  however,  some  side 
effects  to  be  considered.  If  the  value  of  t„.  is  such 
•that  it  extends  to  overlap  the  processor  execution  time  of 
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the  next  instruction,  then  multiple  processor  execution 
units  must  be  provided.  Also  there  must  be  checking  to 
insure  that  the  result-,  of  the  first  instruction  is  not  an 
operand  of  the  subsequent  instruction. 

Since  t  d  is  not  going  to  appear  in  the  final  re¬ 
lations  we  do  not  have  to  consider  and  t^  and  we  can 
compute  E(t)  directly.  Probably  the  best  way  to  find  E(t) 
is  to  use  the  instruction  execution  tree  of  section  C. 

The  probabilities  associated  with  the  branches  of  the  tree 
are  the  same;  only  the  instruction  execution  times  have 
to  be  changed.  The  tree  is  presented  in  figure  5.  We 
now, using  the  approach  of  section  C,  compute  E(t): 

E(t)  =  (1/m)(tv/  ♦  tw  +  ta)  + 

(1  -  1/m)(ta  ♦  td  ♦  ta  ♦  tw)  + 

(1  -  1/ra)(ta  +  td  +  tft) 

=  2ta  ♦  td  ♦  (1/m)(2tw  +  ta  -  td).  (3.26) 
The  IER  can  now  be  computed: 

IER  =  1/(2ta  +  td  +  (1/m)(2tw  ♦  ta  -  td))  (3.27) 
which  for  small  td  and  large  m  goes  to  1/2ta. 
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Chapter  IV  Multiprocessor  Computers 
A.  The  Multiprocessor  Problem 

One  of  the  main  reasons  for  the  anaylsis  presented 

•  * 

in  chapter  III  is  to  indicate  some  limits  on  the  IER  ob¬ 
tainable  in  a  single  processor  computer.  To  get  a  higher 
IER  in  a  single  processor  computer  (single  instruction 
stream)  it  is  probably  necessary  to  have  a  definite  struc¬ 
ture  in  the  information  to  be  processed.  For  example,  if 
the  information  can  be  structured  as  n-component  vectors 
and  we  organize  the  processor  so  as  to  have  n  execution 
units  which  are  capable  of  performing  simultaneous  oper¬ 
ations  on  each  of  the  n  components  (and  provide  a  suitable 
memory  organization),  then  we  can  obtain  an  IER  which  is 
about  n  times  that  which  would  be  obtained  if  the  data  were 
treated  in  a  scalar  form.  This  is  essentially  the  approach 
taken  in  the  Illiac  IV  (Barnes,  et  al.,  1968).  As  might  be 
expected  there  are  considerable  difficulties  in  realising 
an  IER  that  high  for  many  practical  problems. 

If  the  information  to  be  processed  cannot  be  so 
structured,  then  to  get  a  higher  IER,  it  is  necessary  to  go 
to  a  multiprocessor  organization  (with  multiple  instruction 
streams).  We  should  note  at  this  point  that  it  is  not  the 
purpose  of  this  thesis  to  indicate  how  the  multiprocessor 
is  to  be  used:  in  particular,  how  a  single,  inherently  se¬ 
quential  task  can  be  broken  down  into  n  tasks  that  can  be 


run  on  an  n-processor  computer.  For  a  discussion  of  this 
see  Rosenfeld  (1969).  A  typical  multiprocessor  organization 
is  presented  in  figure  1  .  The  most  important  aspect  of 
the  multiprocessor  organization  is  the  sharing  of  a  common 
memory  system  by  all  the  processors.  As  the  processors 
randomly  direct  requests  to  the  memories,  it  is  inevitable 
that  conflicts  will  arise  in  that  a  processor  will  request 
service  from  a  memory  that  is  busy  servicing  another  pro¬ 
cessor  request.  The  function  of  the  switch  in  figure  1  is 
to  direct  processor  requests  to  the  correct  memory  and  to 
resolve  conflicts  by  deferring  requests  to  busy  memories 
to  subsequent  memory  cycles.  Since  we  assume  that  the  pro¬ 
cessor  requests  to  the  memories  are  random,  we  have  what  is 
termed  a  stochastic  service  system.  The  study  of  such  sys¬ 
tems  is  called  queueing  theory  and  in  queueing  theory 
terminology  the  multiprocessor  system  isanm-server  system 
with  a  finite  service  requesting  population  (the  ri  proces¬ 
sors).  The  servers  are  unique  in  the  sense  that  they  can 
handle  only  requests  directed  specifically  toward  them. 
(Usually  an  m-server  system  is  taken  to  be  one  in  which  any 
server  can  service  any  request.)  The  memories  are  charac¬ 
terized  by  constant  service  time  ta  followed  by  an  interval 
t  when  they  are  unavailable  to  service  requests.  New  re- 

•V 

quests  for  service  sire  generated  by  processors  after  some 
interval  (t^  or  t@^)  hs^  elapsed  since  their  last  request 
was  serviced.  These  combined  aspects  of  the  multiprocessor 
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do  not  allow  it  to  be  handled  by  the  common  models  of 
queueing  theory.  It  does  not  appear  to  the  author  that  a 
rigorous  solution  of  this  queueing  situation  can  be  readily 
obtained.  Given  this,  there  are  basically  two  approaches 
that  can  be  taken:  (1)  simplify  the  model  sufficiently  so 
that  it  can  be  solved  by  rigorous  methods  or  (2)  attempt 
an  approximate  solution.  The  latter  approach  is  taken  in 
this  thesis;  the  analysis  appears  in  the  subsequent  sec¬ 
tion  a  of  this  chapter.  The  former  approach  is  taken  by 
Skinner  and  Asher  (1969)*  They  model  the  multiprocessor 
as  a  discrete  Markov  chain.  The  basic  time  interval  is  a 
memory  cycle  time.  They  assume  a  matrix  of  probabilities 
which  express  the  likelihood  that  a  given  processor  requests 
service  from  a  given  memory  at  the  beginning  of  the  memory 
cycle.  They  also  assume  matrices  of  probabilities  which 
express  the  likelihood  of  the  various  outcomes  that  can 
arise  when  there  are  simultaneous  requests  to  one  memory 
by  several  processors.  The  states  of  the  modelled  system 
are  characterized  by  the  processors  delayed  and  the  memories 
for  which  they  are  delayed.  A  state  transition  matrix 
is  formed  from  the  previously  mentioned  probabilities  and 
from  this  matrix  the  steady-state  probabilities  of  the 
various  states  are  determined.  With  this  information  the 
average  amount  of  delay  experienced  by  a  processor  in 
making  a  memory  request  is  computed. 

There  are  two  problems  with  this  approach.  The  first 
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is  that  as  the  number  of  memories  and  processors  increases, 
the  number  of  potential  states  of  the  system  becomes  quite 
large  and  it  is  difficult  to  obtain  other  than  a  numerical 
solution.  The  second  problem  is  obtaining  the  required  pro¬ 
babilities.  The  probability  that  a  processor  directs  a 
request  to  memory  during  a  given  time  interval  is  dependent 
not  only  on  the  relation  of  the  memory  speed  to  the  proces¬ 
sor  speed  but  also  on  the  amount  of  delay  a  processor  ex¬ 
periences  in  getting  a  memory  request. serviced.  Since,  in 
essence,  that  delay  is  what  the  analysis  is  supposed  to 
determine,  it  is  difficult  to  see  how  the  required  proba¬ 
bilities  can  be  obtained  in  an  analytic  manner.  (Skinner 
and  Asher  obtain  the  probabilities  that  they  use  in  their 
model  by  first  simulating  the  system  and  then  making  measure¬ 
ments  on  the  simulated  system.  The  necessity  of  doing 
this  would  seem  to  diminish  the  utility  of  the  model.) 

B.  Modified  Instruction  Format 

The  previous  discussion  has  assumed  the  single 
address  format  instruction  as  the  model  of  a  computer 
instruction.  This  instruction  format  has  an  ITD  as  follows: 


b* . J  L_!°i  , 


V 
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The  instruction  consists  of  two  instances  of  the  following 
operation  sequence:  the  accessing  of  memory  followed  by  an 
interval  of  processor  activity.  Hence  an  execution  of  a 
single  address  instruction  can  be  approximated  as  two  suc¬ 
cessive  executions  of  a  simple  instruction  with  the  fol¬ 
lowing  ITD: 


M 


4 


P 


l 


4 


where  tpi,  the  average  processor  activity  time,  is  defined: 

tpi  =  (td  ♦  te±)/2.  (4.1) 

This  instruction  is  termed  a  unit  instruction.  The  exe¬ 
cution  rate  for  unit  instructions  is  termed  UER  to  dis¬ 
tinguish  it  from  the  IER.  For  the  situation  here  UER  =  2  x 
IER.  We  will  now  average  over  the  instruction  set  and  com¬ 
pute  a  single  value  t  defined: 

tp  *  ^  Vpi  • 

We  will  henceforth  assume  that  all  the  instructions  of  the 
instruction  sel  are  made  up  of  unit  instructions  with  a 
single  value  tp.  As  the  analysis  of  chapter  III  would 
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the  important  relation  to  consider  is  that  be- 
and  tw.  There  are  three  cases  of  interest  and 
discussed  individually  in  the  following  sections: 

1*  tp  =  tw 

2.  tp  <  tw 

3* 

To  mention  the  unit  instruction  only  in  relation 
to  the  single  address  format  instruction  is  to  overlook  its 
considerable  generality.  Obviously,  instructions  with  no 
operand  reference  map  directly  into  unit  instructions,  but 
the  operation  sequence  of  the  unit  instruction  is  suffi¬ 
ciently  basic  -  a  memory  access  followed  by  processor  ac¬ 
tivity  -  that  nearly  any  instruction  format  can  be  easily 
mapped  into  a  series  of  them.  For  example,  consider  a  two 
(operand)  address  instruction  format  which  has  the  following 
ITD: 


suggest, 
tween  tp 
they  are 


operand  1  operand  2 

fetch  fetch 
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This  is  mapped  into  three  unit  instructions  each  with  an 
average  processor  activity  time  tp=  (td+  tei)/3.  Since 
three  unit  instructions  are  required  the  UER  =  3  x  IER. 

Other  instruction  formats  may  be  handled  in  a  similar 
manner. 

C.  Multiprocessor  with  t  =  tw 

In  order  to  facilitate  the  discussion  a  further 
change  in  the  instruction  format  is  indicated.  Unlike  the 
change  in  section  B  the  following  is  purely  a  conceptual 
transformation  which  introduces  no  approximation  in  the 
analysis.  The  ITD  for  the  unit  instruction  when  tp  =  tw  is: 
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The  memory  now  has  an  access  time  of  tc  and  zero  recycle 
time;  the  processor  execution  time  is  also  zero.  This 
transformation  introduces  no  change  in  the  sense  that  the 
performance  of  a  system  with  either  ITD  is  the  same.  For 
both  IT.D's  the  memory  access  begins  at  point  1  and  the  pro¬ 
cessor  execution  and  the  memory  recycling  are  completed  at 
point  2.  With  this  new  instruction  format  we  are  now  ready 
to  determine  the  UER  for  the  multiprocessor. 

Let  us  assume  that  we  have  m  memories  (m-way  inter¬ 
leaved)  denoted  M.j  j  =  1,...,m;  and  n  processors  denoted 

J 

P^;  i  =  1,...,n.  From  the  instruction  format,  we  can  see 
that  one  unit  instruction  is  executed  for  each  memory  cycle 
which  is  utilized  by  a  processor.  The  maximum  rate  at  which 
memory  cycles  are  available  is  m/t  and  this  represents 

C 

an  upper  bound  on  the  UER.  The  problem  of  finding  the  UER 
reduces,  in  essence,  to  that  of  determining  that  fraction 
of  the  total  number  of  memory  cycles  which  are  utilized  by 
all  the  processors.  Notice  that  this  represents  quite  a 
different  approach  from  that  used  in  chapter  III.  The 
analysis  i i  chapter  III  might  be  termed  processor  oriented 
since  the  time  required  to  access  memory  is  simply  considered 
a  delay  which  is  added  to  the  processor  execution  times  in 
order  to  get  the  total  instruction  execution  time.  The 
analysis  of  this  chapter  is  memory  oriented  since  the  pro¬ 
cessors  are  considered  only  to  the  extent  that  they  affect 
the  memory  utilization. 
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We  will  term  a  processor  queued  if  it  is  either 
waiting  for  or  in  the  process  of  receiving  memory  service. 

A  memory  is  termed  occupied  if  it  has  one  or  more  processors 
queued  and  unoccupied  if  it  has  no  processors  queued.  A 
processor  is  termed  active  if  it  is  currently  being  ser¬ 
viced  by  a  memory.  Let  us  consider  an  interval  of  time 
equal  to  t  .  For  each  memory  which  is  occupied  at  the 
beginning  of  the  interval,  there  is  exactly  one  memory 
request  serviced  during  that  interval  and  hence  exactly 
one  unit  instruction  executed.  For  each  memory  that  is 
unoccupied  at  the  beginning  of  the  interval  there  are  no 
memory  requests  serviced  (and  hence  no  unit  instructions 
executed).  Because  the  modified  instruction  format  has 
tp  =  0  there  are  always  n  processors  queued.  Let  us  now 
define  a  random  variable  Z^;  j  =  1,...,m;  where: 

'  0  if  is  unoccupied 

Z,  =  J  3  (4.3) 

J  1  if  Mj  is  occupied 

If  X  is  a  random  variable  which  takes  on  values  equal  to 
the  number  of  occupied  memories,  then: 

X  =  Iz,,  (4.4) 

J  3 

The  expected  value  of  X,  E(X),  is  the  average  number  of 
occupied  memories.  P^om  the  previous  discussion  it  should 
be  clear  that: 

UER  =  E(X)/tc  . 


(4.5) 
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From  equation  4.4  vre  have: 

E(X)  =  E(  K  ZJ  =  He(Z.,)  (4.6) 

3  3  3  3 

v/here  E(Z j )  is  the  expected  value  of  Zy  Since  all  the 
memories  are  identical,  equation  4.6  reduces  to: 

E(X)  =  mE(Zj)  for  any  j.  (4.7) 

We  now  wish  to  focus  on  one  memory  Mj  and  determine  Zy 
The  approach  used  here  is  related  to  the  occupancy  (or 
distribution)  problems  of  combinatorial  analysis  (Feller, 
1968).  From  the  foregoing  discussion  we  know  that  there 
are  always  n  processors  queued.  The  probability  of  any 
given  processor  memory  request  going  to  any  given  memory 
and  hence  queued  for  that  memory  is,  as  in  chapter  III, 

1/m.  In  particular, the  probability  that  any  given  processor 
is  queued  for  Mj  is  1/m  and  the  probability  that  any  given 
processor  is  not  queued  for  Mj  is  1  -  1/m.  If  Y  is  a  random 
variable  equal  to  the  number  of  processors  queued  for  M^, 
the  probability  that  Y  =  r  is  given  by  a  binomial  distri¬ 
bution; 

p(Y  -  r)  -  p(r)  =  (p)(1/o)r(l  -  )/u)n_r.  (4.8) 

From  the  definition  of  Zj  and  Y  we  now  compute  E(Z^): 

E(Zj)  =  (0)p(0)  ♦  C(1)p(r) 

J  r=1 

=  C  P(r)  -  p(0) 
r=0 


1  -  p(0) 


(4.9) 
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Using  equation  4.8  in  equation  4.9  we  find: 

E(Zj)  =  1  -  (1  -  1/m)n  .  (4.10) 

We  then  use  equations  4*5  and  4.10  to  compute  the  UER: 

UER  =  (m/t  )(1  -  (1  -  1/m)n)  .  (4.11) 

C 

E(X),  the  average  number  of  occupied  memories,  is  a  function 
of  m  and  n;  let  us  call  this  function  g(m,n): 

g(m,n)  =  m(1  -  (1  -  1/m)n).  (4.12) 

The  function  g(m,n)  has  certain  properties  of  interest: 

1.  For  m,n>1,  g(m,n)  is  monotonically  increasing 
in  m  and  n.  This  shows  that  we  always  get  an 
improvement  in  the  UER  by  adding  another  memory 
or  processor. 

2.  g(m,n)  <:  minimum  (m,n).  The  number  of  unit 

instructions  executed  during  an  interval  t  can 

c 

not  exceed  the  number  of  memories  or  processors. 

We  might  have  stated  the  problem  of  finding  the 
UER  as  follows.  Let  us  randomly  distribute  n  processors 
among  m  memories.  The  UER  is  the  average  number  of  memories 
v/hich  receive  processors  multiplied  by  1/t  .  Riordan  (1953) 
shows  by  quite  different  methods  than  we  have  employed 
that  the  average  number  of  memories  which  would  receive 
processors  is  g(m,n).  (Riordan’ s  work  is  in  combinatorial 
analysis;  he  speaks  of  balls  and  cells  rather  than  processors 
and  memories.)  This  method  of  problem  formulation  shov/s 
the  approximate  nature  of  the  analysis.  It  has  been  im¬ 
plicitly  assumed  that  all  n  processors  make  random  requests 
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during  each  interval  of  time  t  .  In  a  real  computer  n*W  n 
processors  make  requests  during  tc;  if  there  are  several 
processors  queued  for  service  at  a  memory,  only  the  one 
serviced  during  the  interval  t  makes  a  new  request  at  the 

C 

end  of  that  interval.  Consequently,  unfavorable  (in  terms 
of  the  effect  on  the  UER)  distributions  of  processors  (a 
number  of  processors  queued  for  one  memory)  tend  to  be  more 
frequent  in  an  actual  computer  than  would  be  suggested  by 
the  analysis.  The  result  of  this  is  that  the  UER  specified 
by  equation  4*11  is  somewhat  higher  than  would  be  actually 
observed.  We  might  expect  that  the  most  significant  devi¬ 
ation  between  the  actual  and  the  computed  UER  to  be  the 
greatest  when  there  is  a  high  probability  of  a  number  of 
processors  being  queued  for  a  single  memory.  This  would 
occur  when  n/m  > 1  and  m  is  small. 

D.  Multiprocessor  with  t  <  tw 

When  t<  t_  the  ITD  is: 
p  w 

M 

i  I 

P  !  lit. 

1  2  3 

As  in  section  C  we  perform  a  transformation  on  the  ITD  to 
got: 
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The  access  time  becomes  t'  =  +  t  ,  the  memory  restore 

a  a  p* 

time  becomes  t'  =  t,„  -  t  and  the  processor  execution  time 

\7  V/  p 

tp  goes  to  zero.  The  transformation  is  such  that  the  per¬ 
formance  of  a  system  with  either  ITD  is  the  same.  For  both 
ITD's  the  memory  access  starts  at  point  1,  the  processor 
execution  is  through  at  point  2,  and  memory  restore  is  com¬ 
pleted  at  point  3. 

We  recall  the  definition  of  an  active  processor 
as  one  whose  memory  request  is  currently  being  serviced. 
When  tLat  service  is  completed,  an  active  processor  can 
make  a  new  request  to  either  an  occupied  or  an  unoccupied 
memory.  If  it  makes  a  request  to  an  occupied  memory,  there 
is  no  appreciable  advantage  gained  from  the  fact  that  tp< 
tw;  the  processor  must  wait  anyway.  On  the  other  hand, 
if  the  request  is  made  to  an  unoccupied  memory  the  proces¬ 
sor's  request  is  serviced  immediately;  and  there  is  an  ad¬ 
vantage  associated  with  tp  being  less  than  tw.  The  proba- 
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bility  of  an  active  processor  making  a  request  to  an  oc¬ 
cupied  memory  is  defined  as: 

p(occ)  -  average  number  of  occupied  memories 


(4.13) 

The  probability  of  a  request  to  an  unoccupied  memory  is: 

p(unocc)  =  1  -  p(occ).  (4.14) 

\7o  estimate  the  number  of  occupied  memories  as  g(m,n).  Hence: 
p(occ)  =  1  -  (1  -  1/m)n  (4.15) 


p(unocc)  =  (1  -  1/m)n. 


(4.16) 


Using  the  ideas  of  chapter  III,  we  compute  the  average 
amount  of  time  required  to  execute  a  unit  instruction  by 
an  active  processor: 

E(  t )  =  p(occ)(t,J  +  p(unocc)(t^) 

■  p(occ)(tc)  +  p(unocc)(ta  ♦  tp) 

=  tc  ♦  (1  -  1/ra)n(tp  -  tw).  (4.17) 

The  rate  of  execution  R  is  just  1/E(t): 


1  ♦  (1  -  1/m)n 


(4.18) 


Now  this  is  the  unit  instruction  execution  rate  for  one  of 
the  active  processors.  The  UER  for  the  multiprocessor  is 
Just  R  multiplied  by  the  number  of  active  processors  which 
1g  also  estimated  as  g(m,n).  Thus: 


UER  = 


ja_  1 

c  1  -  (1  -  1/m) 


n  w 


(4.19) 
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Since  the  denominator  of  the  fraction  is  less  than  one, 

the  UER  is  greater  for  the  case  of  t  than  for  the 

p  w 

case  of  t  =  t  when  m  and  n  are  the  same, 
p  w 

We  have  used  g(m,n)  as  an  estimate  of  both  the 
average  number  of  active  processors  and  occupied  memories. 
Actually  for  tp<  ,  the  average  number  would  be  somewhat 
higher  than  g(m,n).  The  increased  number  of  occupied  mem¬ 
ories  would  tend  to  decrease  the  performance  (since  there  is 
a  reduced  probability  of  a  request  to  an  unoccupied  memory) 
while  the  increased  number  of  active  processors  would  tend 
to  increase  it.  Simulation  studies  suggest  that  the  ef¬ 
fects  almost  cancel  (chapter  V). 


E.  Multiprocessor  with  tp>  tw 


When  tp>tw  the  following  ITD  applies: 


M  f 


* 

w 

t 


This  ITD  can  be  transformed  to  the  following: 


M  l* 


-s 
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The  memory  access  time  becomes  tc,  the  memory  recycle  time 
goes  to  zero,  and  the  processor  time  goes  to  t'  =  tp  -  t^. 
Again  the  transformation  is  such  that  the  performance  of 
a  system  with  either  ITD  is  the  same.  For  both  ITD’s 
the  memory  access  begins  at  point  1 ,  the  memory  is  restored 
at  point  2,  and  the  processor  execution  is  completed  at 
point  3. 


For  the  previous  two  cases  there  were  always  n 
processors  queued,  lor  this  case,  because  t^X),  there 
will  be,  in  general,  less  than  n  processors  queued.  This 
introduces  an  additional  complication  into  the  analysis. 

Let  us  suppose  that  there  is  a  constant  (in  other  words, 
independent  of  time  and  tb-*  state  of  the  memory  queues) 
probability  pn  that  a  given  processor  is  queued  for  memory 
service.  Let  Q  be  a  random  variable  which  takes  on  values 
equal  to  the  number  of  processors  so  queued.  The  proba¬ 
bility  that  k  processors  out  of  n  are  queued  is  given  by 
a  binomial  distribution: 

.  P«J  =  k>  =  P(k)  =  (£  )(PB)k(1  -  Pm)“‘k  (4.20) 

When  k  processors  are  queued  the  average  rate  which  at  unit 
instructions  are  executed  is  given  by  equation  4.11  with 
n  replaced  by  k.  Defining  this  as  R(k)  we  have: 

R(k)  =  (m/tc)(1  -  (1  -  1/m)k).  (4.21) 

The  non-zero  value  of  t^  does  not  affect  in  any  direct  way 
the  rate  at  which  instructions  are  executed;  however  its 
effect  is  felt  indirectly  through  its  influence 


on  the  value  of  pffl.  We  now  compute  the  UER  as  the  expected 
value  of  R.  From  equation  4.20  ana  4.21  we  have: 

UER  =  £  R(k)p(k), 
k 

-  £  (o/tc)(£)(1-  (1  -  l/m)k)(pn)k(l  -  pn)n“k 
k 

=■  (m/tc)(1  -  £  <£)(pm<1  -  1/m))k(1  -  Pm)n'k) 
k 

=  (m/tc)(1  -  (1  -  pm/m))n,  (4.22) 

frhe  laat  result  is  from  applying  the  binomial  theorem  to 

the  summation.)  We  note  the  UER  specified  by  equation  4.22 

is  a  function  of  pffl.  Equation  4.22  is  identical  to  equation 

4.11  except  for  the  replacement  of  1/m  with  pm/m.  Since 

p„  2 1  the  UER  for  t  =  t  is  greater  than  the  UER  for  t  >  t 
m  p  w  p  w 

assuming  the  same  values  for  m  and  n.  Because  the  number 
of  processors  queued  is  binomally  distributed,  the  average 
number  of  processors  queued  is  npffl.  Hence: 

*  average  number  of  processors  queued  >  ,, 

A  flow  diagram  of  the  instruction  execution  is  as  follows: 
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Serviced  memory  requests  leave  the  memory  system  at  a  rate 
specified  by  equation  4.22.  They  then  experience  a  delay 
tp  before  making  a  new  memory  request.  Let  Up  be  the 
average  number  of  processors  not  queued  in  the  memory  sys¬ 
tem  and  nffl  the  average  number  queued.  Necessarily: 

np  *  nm  =  n  (4.24) 

and  thus  substituting  in  equation  4.23  we  have: 

pm  =  (n  -  np)/n.  (4.23) 

From  the  flow  diagram  the  average  number  of  processors 
not  queued  must  be  the  product  of  the  average  unit  instruc¬ 
tion  processing  rate  and  the  delay  t'.  Thus: 

np  .  UER(pm)tp.  (4.26) 

Using  equation  4.26  and  the  relation  tp  =  tp  -  tff  we  have: 

p„  ■  '  -  sr  <’  -  o  -  - 1.)  (4.27) 

c 

or 

0  =  pffl  +  (m/n)(-2^Z — E)(i  -  (1  -  pffl/m)n)  -  1 

(4.28) 

t  Vi 

which  is  an  n  order  polynomial  equation  in  pm.  It  must 
be  solved  for  the  value  of  in  the  interval  (0,1).  That 
there  exists  one  and  only  one  solution  of  equation  4.28 
in  (0,1)  can  be  seen  by  considering  equation  4.27.  As  pffl 
goes  from  zero  to  one,  the  left  hand  side  of  equation  4.27 
increases  monotonically  $rom  zero  to  one  while  the  right 
hand  side  decreases  monotonically  from  one.  There  is  one 
and  only  one  value  of  pffl  in  (0,1)  for  which  the  right  and 
left  hand  sides  are  equal.  Once  a  value  of  pm  is  obtained, 
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it  is  substituted  in  equation  4 .22  to  obtain  the  UER. 
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Chapter  V  Simulation 

A.  Reasons  for  Simulation 

In  the  multiprocessor  analysis  of  chapter  IV 
two  principal  approximations  are  made.  The  first  is  the 
replacement  of  the  single  address  format  instruction  with 
two  successive  unit  instructions  and  the  associated  aver¬ 
aging  over  the  instruction  set  to  get  a  value  of  tp.  The 
second  approximation  is  the  treating  of  the  inherent  multi¬ 
processor  queueing  problem  as  a  distribution  or  occupancy 
problem.  The  intent  of  the  simulation  studies  is  to  ascer¬ 
tain  the  effects  of  the  approximations  over  a  limited  set 
of  cases. 

B.  The  Simulator 

The  simulator  is  the  "next  most  imminent  event" 
type.  In  the  simulator  certain  rules  are  applied  to  deter¬ 
mine  the  sequence  of  events  in  the  simulated  system  and  the 
timing  of  the  events  is  determined  accordingly,  A  simulator 
of  this  type  is  quite  simple  and  executes  rather  rapidly. 

The  simulator  is  set  up  to  handle  n  processors 
denoted  P^;  i  =  1,...n;  and  m  memories  denoted  Mj  ; 
j  -  1,...,m;  where  m  and  n  are  arbitrary  and  specified  at 
run  time.  Associated  with  each  memory  is  a  time  tfflj 
which  is  .the  earliest  time  Mj  can  initiate  servicing  a. new 
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memory  request.  Similarly,  associated  with  each  processor 
is  a  time  tp^  which  is  the  earliest  time  processor 
can  initiate  a  new  request  for  memory  service.  The  simu¬ 
lator  is  arranged  so  that  one  cycle  of  simulation  corresponds 
to  the  execution  of  one  unit  instruction  by  one  processor. 


There  are  two  basic  rules  which  govern  the  sequence  of  events 
in  the  simulator.  First,  the  instruction  unit  execution  of 
any  given  simulation  cycle  is  always  associated  with  the 
processor  P^  for  which  the  value  of  t  ^  is  a  minimum  at  the 
beginning  of  the  cycle.  (If  there  is  more  than  one  value  of 
i  for  which  tp^  is  a  minimum,  then  the  largest  value  of  i 
is  arbitrarily  chosen.)  Second,  an  instruction  unit  execution 


involving  P^  and  Mj  always  commences  at  the  maximum  of  the 
times  tpl  and  tnj  since  that  is  the  earliest  time  at  whi^h 
both  Pj^  and  Mj  are  available.  With  these  rules  in  mind  we 
can  follow  step  by  step  the  action  of  the  simulator  whose 


flowchart  appears  in  figure  6. 


1 .  The  simulator  is  initialized.  The  values 


m,  n,  tc,  and  tw,  are  specified;  tmj  and  tpi 
are  set  to  zero  for  all  i  and  j.  Y  is  set  to  the 
total  number  of  unit  instruction  executions  to 


be  simulated. 


2.  The  value  i  is  selected  so  that  t  .  is  a 

Pi 

minimum.  This  corresponds  to  the  selection  of 
the  processor  for  the  current  simulation  cycle. 

3.  The  value  of  j  is  selected.  Similarly,  this 
corresponds  to  the  selection  of  the  memory  for 


instructions 
executed 
V  ? 


Figure  6.  Simulator 


Compute  pre¬ 
dicted  value 
of  UER 


mmm 
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the  current  simulation  cycle.  A  multiplicative 
congruent  random  number  generator  (Kruskal,  1969) 
is  used  to  uniformly  generate  integers  in  the 
range  1,...,m. 

4.  The  value  of  tp  is  selected.  Normally  it  is 
a  constant  but  in  one  simulation,  however,  it  is 
is  selected  so  that  for  any  given  processor  it 
oscillates  between  two  values  whose  average  is  tp. 
9.  A  start  time  (for  the  unit  instruction  exe¬ 
cution  of  the  current  simulation  cycle)  t8  is 
computed  as  the  maximum  of  t?i  and  tmj.  The 

time  t  j  is  then  set  to  the  sum  of  the  start 
mj 

time  and  the  memory  cycle  time  t„.  The  time 

V  18  89t  t0  the  8U"  of  the  8tart  time>  the 

memory  access  time  tfi  and  the  processor  exe- 
cutlon  tlM  tp. 

6.  If  less  than  7  unit  instructions  have'  been 
executed,  another  simulation  commences  at  step  2. 
Otherwise  the  computation  of  the  results  begins 
at  step  7. 

7.  The  exact  time  at  which  the  simulation  ends 
is  not  precise;  there  are  m  +  n  times  in  the 
simulator.  Probably  the  most  reasonable  estimate 
of  the  end  time  is: 

T  s  m  +  n^  £  tpi  +  51  tmj 

i  J 


#< 
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8.  The  simulation  unit  instruction  execution 
rate  UERa  is  the  ratio  of  the  total  number  of 
executions  to  the  time  required  to  execute 
them;  hence: 

UESs  =  Y/T  . 

9.  The  UER  is  computed  from  either  equation 
4.11,  4.19,  or  4 .22  depending  on  the  relation 
of  t  to  t  .  If  equation  4.22  is  approriate 

r  " 

then  the  value  of  pQ  must  be  computed.  A  Newton- 
Raphson  search  technique  (Pierre, 1969)  ie  em¬ 
ployed;  it  converges  rapidly  to  the  value  of  pffl 
in  the  interval  (0,1). 

As  can  be  seen  from  the  above,  the  amount  of  simulator  activ¬ 
ity  per  unit  instruction  execution  is  independent  of  m.  The 
value  of  n  only  determines  the  number  of  values  of  i  which 
must  be  searched  to  find  the  minimum  in  step  2. 

The  actual  simulator  is  more  involved  than  des¬ 
cribed  here  and  has  provision  for  several  instruction  for- 
matf  (including  the  simple  unit  instruction  used  here).  It 
accepts  input  in  the  form  of  an  instruction  set  where  the 
format,  execution  times  and  relative  frequency  of  each 
instruction  may  be  specified.  The  sequence  of  events  in 
the  simulator  when  executing  unit  Instructions  is  precisely 
just  those  described  above.  The  simulator  is  written  in 
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Algol1  for  the  Uni  vac  1108  and  it  simulates  the  execution 
of  about  2000  unit  instructions  per  second. 

C.  Results 

A  set  of  simulation  results  is  presented  in 
figures  7  through  1 1 .  The  results  are  presented  in  a  normal¬ 
ized  form:  the  performance  of  a  one  processor,  one  memory 
system  with  t  =  tw  is  taken  to  be  one.  We  use  figures  7 
through  10  to  verify  the  basic  multiprocessor  analysis;  figure 
1 1  is  used  to  verify  the  instruction  reduction.  The  figures 
are  discussed  individually  and  represent  various  cases  of 
interest: 

Figure  7:  For  this  case  tp  =  t^  =  0.5tc.  The 
difference  between  the  predicted  and  the  simulated 
values  is  small.  The  maximum  deviation  is  about 
with  the  predicted  value  higher  than  the  simu¬ 
lated  and  occurs  when  the  ratio  of  processors  to 
memories  is  one  or  greater.  This  is  in  accor¬ 
dance  with  the  observation  of  chapter  IV,  section 
C. 

Figure  8:  Here  tp  ■  0.1  tc  and  tw  =  0.5tc.  The 
maximum  deviation  observed  is  about  10$;  the 
predicted  performance  again  higher  than  the 
simulated  performance.  The  worst  deviations 
^A  text  of  the  simulator  appears  in  the  appendix. 
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occur  when  3  or  4  very  fast  processors  (tp  *0.1 tc) 
are  used  with  2  to  4  memories.  (This  is  a  situ¬ 
ation  which  would  not  likely  occur  in  practice 
because  it  probably  would  be  uneconomical  to 
configure  a  system  in  this  fashion.)  This  is 
again  in  accordance  with  the  observation  of 
chapter  IV,  section  C.  The  slope  of  the  four- 
processor  performancd  curve  is  still  high  even 
when  n  *  1 6,  suggesting  that  the  performance  of 
the  system  can  be  significantly  improved  by 
adding  memories.  (This  may  or  may  not  be  economical 
though.)  Although  figure  8  does  not  show  it, 
the  curves  for  the  simulated  and  predicted  results 
converge  for  n  *  3  and  n  =  4  when  m>64. 

Figure  9 :  This  is  the  same  as  figure  8  except 
that  tp  *  0.2tc.  Basically  the  same  comments 
apply. 

Figure  10:Here  tp  is  greater  than  tw:  tp  =  2tc  =  4tw. 
The  results  show  excellent  agreement  of  the  simu¬ 
lated  and  the  predicted  results.  For  m  >  1  the 
corresponding  curves  are  nearly  indistinguishable. 
Figure  1 1 : These  simulation  results  are  presented 
to  verify  the  reduction  of  the  instruction  set  to  a 
unit  instruction  with  a  single  value  Of  tp.  The 
curves  present  two  different  simulations. .  For 

one  the  value  of  t„  oscillates  between  O.lt,.  and 
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0.9tc  for  a  given  processor  with  a  mean  0.5tc; 

9 

for  the  other  a  constant  value  of  tp  =  0.5tc 
is  used.  The  results  suggest  that  the  reduction 
is  probably  a  reasonable  one.  The  relative  per¬ 
formance  is  slightly  lower  for  the  case  where 
tp  varies  than  for  the  case  where  tp  is  fixed; 
this  is  generally  in  accordance  with  what  we 
would  expect  for  a  stochastic  service  system. 

D.  A  Comparison  with  Other  Simulation  Results 

By  using  some  published  multiprocessor  simulation 
results  it  is  possible  to  provide  a  form  of  independent 
verification  of  the  analytic  results  of  chapter  IV.  Rosen- 
fold  (1969)  discusses  the  results  of  simulation  of  the 
solution  (by  Qauss-Seidel  iteration)  of  a  set  of  simultaneous 
linear  algebraic  equations  on  a  multiprocessor  computer. 

The  processors  simulated  have  the  general  charac¬ 
teristics  and  instruction  set  of  the  IBM  360  computer  series. 
Although  the  relative  frequencies  and  processor  execution 
times  for  the  instruction  set  are  not  given,  a  set  of  total 
instruction  execution  times  (which  are  presumably  nominal 
times  for  a  single  processor  computer)  are  given.  From 
these  times  it  appears  that  the  instruction  execution  time 
is  roughly  equal  to  the  memory  cycle  time  multiplied  by  the 
number  of  memory  cycles  needed  to  execute  the  instruction. 
Thus  one  can  reasonably  estimate  that  the  average  process rr 
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activity  time  is  about  equal  to  the  memory  restore  time,  and 
hence  Rosenf eld’s  system  can  be  described  in  our  termin¬ 
ology  as  a  multiprocessor  with  tp  =  tw. 

As  we  discussed  in  chapter  IV  the  IER  is  directly 
determined  by  the  extent  of  memory  cycle  utilization.  For¬ 
tunately,  one  of  the  measurements  Rosenfeld  makes  on  his 
simulated  system  is  the  memory  cycle  utilization  and  this 
makes  a  direct  comparison  with  his  results  quite  simple. 

For  the  multiprocessor  with  tp  =  tw  the  memory 
cycle  utilization  is  speciifed  by  the  function  g(m,n)  de¬ 
fined  by  equation  4.12: 

g(m,n)  =  m(1  -  (1  -  1/m)n). 

Figure  12  shows  Rosenfeld' s  observed  memory  utilization 
(solid  lines)  plotted  together  with  g(m,n)  (broken  lines). 

The  agreement  between  the  simulated  and  the  predicted  results 
is  rather  good  with  the  utilization  in  the  simulated  gener¬ 
ally  somewhat  higher.  At  least  one  reason  may  be  advanced 
to  account  for  this:  an  incorrect  assumed  value  of  tp. 

If  tp  were  assumed  somewhat  less  than  tw,  the  analytically  pre¬ 
dicted  value  of  memory  utilization  would  increase  and  the 
curves  for  the  simulated  and  the  predicted  results  would 
become  nearly  identical.  Regardless  of  the  value  of  t 
assumed, the  general  shape  of  the  curves  reflecting  the  sim¬ 
ulated  and  predicted  results  is  the  same. 
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Figure  12.  A  Comparison  with  Rosenfeld 
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Chapter  VI  An  Analysis  of  I/O  Effects  on  Processor 
Performance 

A.  I/O  Activity 

In  chapter  II  we  discussed  the  technological 
reasons  necessitating  the  presence  of  primary  and  secondary 
memories  in  computer  systems.  The  information  stored  in 
a  secondary  memory  is  moved  into  the  primary  memory  only 
when  it  is  actually  ready  to  be  used  by  the  processor  and 
after  processing  it  is  returned  to  the  secondary  memory. 

The  information  flow  between  the  memories  is  generally 
called  input/output  (i/o)  activity  and  this  activity  has  a 
degrading  effect  on  the  UER.  Each  word  of  information 
transferred  between  the  primary  and  the  secordary  memory 
usually  uses  one,  cycle  of  the  primary  memory.1  If  both  the 
i/o  and  the  processors  are  active  simultaneously,  conflicts 
arise  when  both  direct  a  request  to  the  same  memory  simul¬ 
taneously.  Normally  if  a  conflict  occurs  the  i/o  request 
is  served  first  and  the  processor  request  is  deferred  until 
the  subsequent  memory  cycle.  In  other  words,  an  i/o  service 
request  has  a  higher  priority  than  a  processor  request.  The 
reason  for  granting  priority  to  the  i/o  is  due  to  the  ro¬ 
tating  character  of  commonly  used  secondary  memories  (drums 
and  discs).  For  each  i/o  request  that  is  not  serviced 

^In  some  computers  additional  cycles  are  used  to  count 
the  number  of  i/o  transfers  and  to  specify  the  memory 
locations  to  which  the  transfers  go. 
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sufficiently  rapidly ,  the  i/o  transfer  process  must  be 
delayed  by  the  time  required  for  one  full  rotation  of  the 

memory  device  thus  delaying  the  processor  waiting  for  the 
i/o  the  same  amount  of  time.  For  some  types  of  i/o  it  is 
possible,  however,  to  implement  a  dynamic  priority  scheme 
where  sometimes  a  processor  request  has  higher  priority 
than  an  i/o  request  and  it  is  shown  in  section  C  that  this 
approach  leads  to  a  smaller  degradation  in  processor  per¬ 
formance  than  the  simple  priority  scheme. 

Several  authors  have  given  analyses  of  the  effects 

p 

of  i/o  activity.  Flores  (1964)  determines  the  extent  of 
queueing  of  i/o  requests  on  memories.  (His  analysis  does 
not  consider  the  processors.)  Flores*  model  is  developed 
from  the  following  ideas.  The  i/o  requests  for  memory 
service  are  assumed  to  be  generated  by  a  Poisson  process 
with  a  mean  request  rate  of  RIO.  The  requests  are  considered 
to  be  uniformly  distributed  among  the  m  memories  and  hence 
each  memory  has  a  mean  request  rate  of  RIO/m.  The  memory 
is  considered  to  be  a  server  (in  a  queueing  sense)  with  a 
constant  service  time  tQ.  The  result  is  a  simple  queueing 
situation  with  Poisson  input  and  constant  service  time. 

The  mean  time  elapsing  between  the  initiation  of  a  memory 
request  and  the  time  the  service  of  that  request  begins  is 
computed.  Flores  does  not  propose,  however,  a  purpose  to 
which  that  time,  once  computed,  can  be  put.  Shemer  and 
Gupta  (1969)  extend  Flores'  model  to  consider  the  effect 

2  -  ,  i  . 

The  notation  in  thp  following  discussion  is  not  that  of 
the  original  authors. 
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of  i/o  activity  on  the  performance  of  a  single  processor. 
In  their  model  a  processor  with  an  average  processing 


time  tp  generates  random  requests  to  the  m  memories.  Simul¬ 


taneously,  i/o  requests  generated  by  a  Poisson  process  with 
mean  rate  RIO  compete  with  the  processor  for  the  available 
memory  cycles.  Their  rather  involved  analysis  allows  for 
l/o  queueing  and  they  compute  the  average  time  required 
to  complete  a  memory  request  initiated  by  the  processor. 


In  order  to  understand  the  relation  of  the  above 
authors1  analysis  to  that  of  this  chapter,  it  is  necessary 
to  look  at  the  nature  of  the  Poisson  process  (Hlllier  and 
Lieberman,  1967)  used  as  a  model  of  the  source  of  i/o 
requests.  Each  memory  experiences  a  mean  request  rate  of 


RIO/m  and  hence  during  an  interval  of  time  tc  the  proba¬ 


bility  distribution  for  the  number  q  of  i/o  request  received 
is: 


pc,m  .  (RiQ/(°/^»y-RI0/(n/tc) . 


The  average  number  of  requests  received  during  t„  is 

c 


RIO/(m/tc),  but  the  above  equation  associates  a  non-zero 
probability  for  any  finite  number  q  of  requests.  (Although 
when  RIO/(m/te)  is  small  the  probabilities  associated  with 
large  values  of  q  fall  off  very  rapidly.)  The  type  of  i/o 
activity  which  is  likely  to  use  a  significant  portion  of 
the  primary  memory  cycles  (and  thus  significantly  affect 
the  UER)  is  that  from  very  high  speed  discs  and  drums 
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(o'  low  speed  core  used  as  a  secondary  memory)  and  each 
of  these  is  characterized  by  a  regular  periodic  flow  rate. 

The  number  of  such  devices  likely  to  be  in  simultaneous 
operation  in  a  computer  system  is  s?,all— « often  one,  per¬ 
haps  as  many  as  three  or  four.  While  the  Poisson  process 
is  a  satisfactory  model  for  representing  the  generation  of 
requests  from  a  number  of  (unsynchronized)  periodic  sources, 
probably  an  equally  satisfactory  model,  when  the  number  of 
sources  is  small,  is  simply  to  assume  that  there  is  a  pro¬ 
bability  RIO/(m/t  )  that  one  i/o  request  per  memory  is  re- 
ceived  during  an  interval  t„  and  a  zero  probability  of  more  than 
one  request.  This  is  especially  suitable  when  each  of  the 
i/o  devices  has  a  small  amount  of  buffering  (as  it  usually 
does).  This  assumption  is  used  in  section  B.  When  there  is 
only  one  i/o  device  with  a  periodic  flow  rate  in  operation, 
an  advantageous  i/o  handling  scheme  can  be  implemented. 

This  situation  is  assumed  in  section  C. 

B.  Simple  I/O  Handling 

In  at  least  one  way  the  i/o  activity  looks  like 
the  n  ♦  1st  processor  in  the  multiprocessor  system  and  it 
would  be  attractive  to  be  able  to  handle  it  as  such.  How¬ 
ever  the  zaultiprocessor  analysis  is  derived  on  the  basis 
of  identical  processors  and  because  of  the  priority  granted 
i/o  requests,  the  i/o  activity  looks  rather  different  from 
a  processor.  The  i/o  activity  does,  like  the  processors, 
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contribute  to  the  occupancy  of  the  memory  system  and  hence 
increases  the  rate  at  which  memory  cycles  are  utilized. 

Our  general  approach  in  the  subsequent  analysis  is  to  com¬ 
pute  occupancy  of  the  memories  with  i/o,  determine  the 
rate  of  memory  cycle  utilization,  and  then  apportion  that 
rate  between  the  processors  and  the  i/o.  (We  shall  pre¬ 
sent  an  analysis  only  for  the  multiprocessor  case  where 

t  =  t„:  extensions  to  cover  the  other  cases  are  not  dif- 
p  w* 

ficult.) 


Let  us  consider  a  particular  memory,  say  M^.  Let 
A  be  the  event  that  Mj  is  occupied  by  a  processor  request 
and  let  B  be  the  event  that  is  occupied  by  an  i/o  re¬ 
quest.  from  the  previous  discussion  the  probability  of  B 
is: 

p(B)  c  w7§^  (6#1) 

where  necessarily  BIO  must  be  such  that  p(B)  < 1.  We  will 
assume  that  the  probability  of  A  is  not  affected  signifi¬ 
cantly  by  the  l/o  activity.  This  is  equivalent  to  the 
assumption  that  A  and  B  are  independent  events.  Thus  when 
tp  B  tw,  p(A)  can  be  computed  from  equation  4.11: 

p(A)  «  1  -  (1  -  1/m)n.  (6.2) 

The  probability  of  a  memory  being  occupied  by  either  a  pro- 
cessor  or  an  i/o  request  is  the  probability  of  the  event 
A  or  B.  When  A  and  B  are  independent  the  probability  of  the 
event  A  or  B  is: 

p(A  or  B)  =  p(A)  +  p(B)  -  p(A)p(B) .  (6.3) 
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Substituting  equations  6.1  and  6.2  in  6.3  we  have: 

p(A  or  B)  =  1  -  (1  -  1/m)n  -  (1  -  1/m)n  . 

c 

(6.4) 

Now,  having  determined  the  probability  of  occupancy  of  one 
memory,  we  can  determine  the  rate  R  at  which  memory  re¬ 
quests  are  serviced  by  multiplying  by  m/tc: 

R  =  (m/tc)p(A  or  B) 

=  <m/tc)(1  -  (1  -  1/m)n  -  (1  -  l/m)n^2)  . 

c 

(6.5) 

The~rate  R  includes  the  service  of  both  i/o  and  processor 
memory  requests.  But  since  the  i/o  requests  are  served 
first,  the  rate  R  includes  exactly  a  rate  RIO  of  serviced 
i/o  requests.  Hence,  the  UER  can  then  be  determined  by 
subtracting  RIO  from  R: 

UER  »  R  -  RIO 

=  (m/t0)<1  -  )(1  -  (1  -  1/m)n)  •  (6.6) 

We  note  that  this  is  just  the  UER  that  would  be  observed 
without  i/o  multiplied  by  the  factor  (1  -  ). 

i 

C.  Dynamic  Priority  I/O  Handling 

We  will  assume  for  this  analysis  that  the  i/o 
requests  originate  from  a  single  periodic  source,  and  we 
will  see  that  it  is  possible  to  specify  a  method  of  handling 
i/o  requests  which  results  in  less  degradation  of  the  UER 


than  that  specified  by  equation  6.6  above.  In  order  to 
simplify  the  discussion  and  the  following  derivation,  let 
us  assume  that  the  ratio  (m/t  )/RIO  is  an  integer  whose 
value  is  N>1.  This  means  that  exactly  1/N  of  the  total 
memory  cycles  of  any  given  memory,  say  M-,  are  used  by  the 

J 

i/o.  Furthermore,  let  us  assume  that  the  i/o  requests  are 
generally  for  sequential  memory  locations  and  hence  Mj  re¬ 
ceives  an  i/o  request  exactly  once  for  every  N  cycles.  Let 
us  assume  that  there  is  associated  with  each  memory  a  one 
word  buffer  to  hold  the  i/o  information  and  a  control  mech¬ 
anism  to  implement  the  following  strategy: 

1 •  If  less  than  N  -  1  cycles  have  elapsed  since 
the  i/o  request  was  received,  the  processor  re¬ 
quests  have  priority;  an  i/o  request  is  serviced 
only  if  there  is  no  processor  waiting  for  service. 
2.  If  N  -  1  cycles  have  elapsed  and  the  i/o 
request  has  not  yet  been  serviced, the  i/o  re¬ 
quest  gets  the  current  memory  cycle. 

We  term  this  dynamic  priority  i/o  handling. 

Let  us  consider  a  sequence  of  N  cycles  for  Mj, 

The  probability  that  a  given  cycle  is  occupied  by  a  processor 
request  is  p(A)  specified  by  equation  6.2.  The  probability 
that  a  given  cycle  is  not  occupied  by  a  processor  request 
is  .1  -  p(A).  In  the  absence  of  i/o  requests  the  probability 
that  k  of  the  cycles  are  used  by  processor  requests  is 
specified  by  a  binomial  distribution: 


p(k)  =  (£)(p(A))k(l  -  P(A))N"k;  k=0,...,N.  (6.7) 

Let  C  be  a  random  variable  equal  to  the  number  of  cycles 
used  to  service  both  i/o  and  processor  requests  during  the 
N  cycle  sequence.  The  expected  value  of  C  is: 

E(C)  =  1  +Niz  (S)(p(A))k(l  -  p(A) )N"kk  + 
k=0  K 

(N  -  1)p(N).  (6.8) 

In  the  above  expression,  the  first  term  accounts  for  the 
cycle  received  by  the  i/o,  and  the  second  term  accounts 
for  the  cycles  received  by  up  to  N  -  1  processor  requests. 

The  last  arises  because  even  if  the  processors  request  all 
N  cycles  of  the  N  cycle  sequence,  they  only  get  N  -  1;  the 
i/o  gets  the  remaining  one.  Equation  6.8  may  be  rewritten: 

E(C)  =  1  *  5T  (2)(p(A))k(1  -  p(A))H‘k  -  p'oo. 
k=0  K 

(6.9) 

The  summation  represents  the  expected  value  of  the  number 
of  processor  requests  during  the  N  cycle  sequence;  hence 
it  is  just  Np(A).  Thus: 

E(C)  =  1  +  Np(A)  -  p(N) .  (6.10) 

The  average  occupancy  of  a  memory  cycle  over  the  N  cycle 
sequence  is  E(C)/N;  we  can  then  compute  R  for  this  case: 

R  =  (m/tc)(E(C)/N) 

=  (m/tc)(1/N  +  p(A)  -  p(N)/N).  (6.11) 

Since  we  assumed  N  =  (m/t_)/RI0  we  have: 

C 

H  =  *  0  -  0  -  1/m)n  -  3^(1  -  0  -  1/o)n)N). 


(6.12) 
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As  before  the  UER  is  determined  by  subtracting  RIO  from 
R: 

UER  =  R  -  RIO 

=  (m/tc)(1  -  4|2(1  -  (1  -  l/m)”)"'1)  x 

(1  -  (1  -  1/m)n).  (6.13) 

The  latter  is  just  the  UER  without  i/o  multiplied  by  the 
factor  1  -  (RIO/(m/t  ) (1  -  (1  -  1/m)n)N“1.  Since 

C 

(1  -  (1  -  l/m)11)^”1  <1  this  method  of  i/o  handling  results 
in  a  lower  degradation  of  the  processor  performance.  If 
(m/t  )/RIO  is  not  an  integer,  then  substituting  for  N  in 
equation  6.13  the  largest  integer  not  greater  than  (m/t  )/RIO 

C 

gives  a  satisfactory  approximation  for  the  UER. 

D.  Example 

Consider  a  4  processor,  4  memory  system  with 
tp  a  t&  «  tw  =  0.5  usee,  and  RIO  =  10^  requests/sec.  The 
UER  without  i/o  is  computed  from  equation  4.11: 

UER( without  i/o)  =  (4/1.0  x  10“^  sec.)  x 

(1  -  (1  -  1/4)4) 

=  2.73  x  106/sec. 

When  the  i/o  is  considered  without  the  dynamic  priority 
scheme  we  use  equation  6.6  to  find: 

UER( simple  i/o)  =  (1  -  1/4) (2.73  x  106/sec.) 

=  2.02  x  106/sec. 

If  we  implement  the  dynamic  priority  i/o  handling  we  find 
N  =  (m/tc)/RI0  =  4  and  hence  using  equation  6.13  we  have:  * 


•f* 
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UER(dynamic  i/o)  =  (1  -  (1/4)0  -  0  -  1/4)^)^)  X 

(2.73  x  10^/sec.) 

=  2.52  x  106/sec. 

Tho  UER  with  the  dynamic  priority  scheme  is  about  23%  higher 
than  that  obtained  without  it— a  substantial  improvement. 


Let  us  define  a  memory  efficiency  effl  as: 
e  -  total  memory  cycles  used/sec. 
m  total  memory  cycles  available/sec , 


_  UER  +  RIO _ 

m/tc 

and  a  processor  efficiency  e^  as: 

e  _  total  instructions  executed/sec. 

p  total  number  of  instructions  executed 
if  there  were  no  memory  delays/sec. 


(6.14) 


(6.15) 


We  can  now  compute  the  efficiencies  for  the  proceeding 


example: 

1 .  No  i/o: 


=  2.73  x  106 
4.00  x  106 


0.67 


2.73  x  106 
8.00  x  106 


0.34 


2.  Simple  i/o  : 


e  =  3.02  x  106 
m  4.oo  x  io6 

e  a*  2.02  x  106 
p  8.00  X  106 


0.75 

0.25 


3.  Dynamic  i/o: 

e  .  3.52  x  ip6 
ra  4.00  x  106 

0  ,  2.52  X  10^ 
p  8.00  x  10° 
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=  0.88 

0.32  . 


Chapter  VII  Computer  Design 


A.  Optimization  Approach 

The  purpose  of  this  chapter  is  to  indicate  how 
the  multiprocessing  models  of  chapter  IV  can  be  used  in  a 
simple  automatic  design  program.  An  appropriate  context 
in  which  to  consider  the  desigh  process  is  that  of  an  opti¬ 
mization  problem.  The  nature  of  the  optimization  problem 
is  to  relate  the  costs  of  a  proposed  design  to  the  variables 
which  reflect  its  structure  and  then  choose  the  values  of 
the  variables  so  that  required  performance  is  obtained  and 
the  cost  of  the  design  is  minimized.  We  have  taken  the  UER 
as  the  basic  measure  of  computer  performance  and  the  form¬ 
ulas  of  chapter  IV  relate  the  HER  to  the  variables  tw,  ta, 
td>  tp,  m,  and  n.  If  we  can  also  relate  the  costs  of  the 
design  to  these  variables  we  have  the  necessary  relationships 
to  formulate  the  optimization  problem  and  hence  to  implement 
an  automatic  design  program. 

B.  Costs  and  the  Problem  Formulation 

The  three  types  of  components  whose  costs  we 
consider  to  enter  into  the  overall  cost  of  the  multiprocessor 
system  are  memories,  processors,  and  switches.  While  it 
is  Interesting  to  consider  the  possibilities  of  relating 
by  formula  the  costs  of  the  components  to  their  specifi¬ 
cations,  the  relations  would  be  both  rather  difficult  to 
obtain  and  probably  (because  of  the  discrete  nature  of  the 
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manufacturing  process)  not  very  meaningful.  Hence  we  shall 
assume  that  the  costs  of  the  components  are  related  to  the 
variables  in  a  tabulated  form. 

There  are  several  considerations  that  enter  into 
the  determination  of  the  individual  component  costs: 

1.  Switches:  The  switch  has  to  connect  n  pro¬ 
cessors  to  m  memories.  Hence  there  are  m  x  n 
potential  connections  implied  in  the  structures 
we  are  considering  and  if  the  cost  to  realize 

a  simple  switch  is  Cg,  the  total  cost  for  the 
multiprocessor  switch  is  about  n  x  m  x  Cg. 

2.  Memories:  As  indicated  in  chapter  II,  the 
cost  per  word  of  a  coincident  current  magnetic 
memory  is  lower  in  memories  of  a  larger  number 

of  words  than  in  one  of  a  smaller  number  of  words. 
We  assume  that  the  total  memory  system  has  been 
specified  in  advance  to  have  w  words.  If  the 
cost  of  a  memory  of  w  words  and  cycle  time  tQ 
is  Cm(  w  ,tc),then  the  total  memory  cost  with 
an  m-  way  interleaved  memory  is  m  x  C  (w/m,t„) 
assuming  that  a  memory  is  available  in  that  size 
and  speed.  The  value  of  tff  is  detei'mined  once 
tc  is  specified  and  the  former  is  not  considered 
a  design  variable. 

3#  Processors:  The  cost  of  the  processor  is 
dependent  on  the  many  different  speeds  associated 
with  its .internal  operations.  Once  an  instruc- 
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tlon  mix  has  been  specified  a  single  value  tp 
is  determined  by  equation  4.2  •  The  cost  of 
the  processors  is  then  n  X'UpCt  ). 

The  above  relations  allow  the  cost  C  of  the 
multiprocessor  computer  to  be  expressed  as: 

0  =  n  X  cm(w/m,  t0(tw))  ♦  n  X  Cp(tp)  ♦ 

m  *  n  x  °s  (7.1) 

The  performance  of  the  multiprocessor  computer  is  specified 
by  equation  (4* 11), (4. 19)  or  (4.22)  depending  on  the  re¬ 
lationship  of  tp  to  tw.  We  symbolically  include  all  three 
equations  in  the  following: 

UER  =  UER(tpl  tw>  tc,  m,  n).  (7.2) 

We  now  state  the  optimization  problem  as: 
minimize  C 

such  that  UBR>UERrequired 

where  C  and  UER  are  specified  by  equations  7.1  and  7,2. 

To  this  other  constraints  may  be  added;  for  example,  one 
limiting  the  number  of  processors  or  stating  that  the  num¬ 
ber  of  processors  must  be  greater  than  two. 

The  approach  we  have  taken  to  solve  the  optimi¬ 
zation  problem  is  an  exhaustive  search  over  the  possible 
values  (as  tabulated)  of  t  and  t.  (implying  t_)  and  over 

pc  W 

a  specified  set  of  values  for  m  and  n.  A  search  space  of 
no  more  than  IcA  points  exists  if  we  assume  about  5  to  10 
values  for  each  of  the  variables. 


An  Algol  program  van  written  for  the  Unlvac  1108 

to  evaluate  equations  7.1  and  7.2  over  a  specified  set  of 

values  of  t_,  t  ,  m,  and  n  and  pick  the  optimum •  Despite 
P  c 

the  fact  that  the  exhaustive  search  approach  lacks  sophis¬ 
tication  (although  it  ic  difficult  to  think  of  other  tec- 
niques  that  could  be  used)  it  has  the  definite  advantage 
that  all  potential  structures  are  evaluated.  Furthermore, 
the  search  is  carried  out  sufficiently  rapidly  (about  0 .2 
sec./lOO  structures)  that  there  is  little  incentive  to 

t 

consider  other  methods.  Thus,  in  addition  to  choosing  the 
optimum  structure,  the  costs  and  performances  of  the  sub- 
optimal  structures  are  also  available  and  it  is  interesting 
to  group  them  according  to  their  performance  and  the  con¬ 
straints  violated.  It  is  always  Important  to  the  designer 
to  know  what  the  sensitivity  of  a  proposed  design  is  to  the 
design  constraints  and  objectives;  that  is,  how  the  design 
would  change  if  the  constraints  and  objectives  were  altered 
somewhat.  This  is  readily  determined  if  an  evaluation  of 
all  potential  structures  in  the  design  space  is  available. 

Nate  that  the  above  formulation  of  the 
optimisation  problem  is  not  the  only  one  possible.  A  de¬ 
sign  goal  might  be  to  design  a  system  that  has  the  maximum 
UER  possible  but  does  not  exceed  a  cost  CQax  •  Another 
design  goal  might  be  to  design  a  system  which  has  the  mini¬ 
mum  cost/per forraance  ratio  (the  cost  of  executing  an  in¬ 
struction  per  unit  time  is  a  minimum).  The  reformulation 
of  the  optimization  problem  to  handle  these  cases  is  per- 


fectly  straight forward.  In  the  subsequent  sections  we 
present  two  examples:  one  minimizes  the  system  cost  for  a 
given  UER;  the  other  minimizes  the  cost/performance  ratio. 

C.  Example  1 :  Minimization  of  System  Cost 

For  this  example  we  have  the  components  avail¬ 
able  as  listed  in  table  1 .  The  design  objectives  and 
constraints  are: 

1.  The  UER  must  equal  or  exceed  10^  instructions 

per  second. 

2.  The  total  memory  size  must  be  64  K  words. 

3.  The  number  of  processors  must  not  exceed 

four. 

4.  The  total  system  cost  must  be  minimized. 

The  component  costs  are  also  indicated  in  table  1.  They 
were  chosen  rather  arbitrarily  but  they  are  probably  not 
unrepresentative  for  memories  in  the  18  to  24  bit  per  word 
size  and  the  related  processors. 

There  are  over  100  configurations  which  meet 
constaints  two  and  three;  49  meet  constraint  one,  and  of 
these,  four  a^e  presented  in  table  2.  The  optimum  design 
is  indicated  by  an  asterisk;  it  is  a  three  processor  system. 
The  other  designs  presented  are  the  best  (in  terms  of  cost/ 
performance  ratio)  using  one,  two,  and  three  processors. 
Overall, the  best  cost/performance  ratio  is  found  in  the 
four  processor  system.  The  best  single  processor  system 
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Memory: 


Size 

1.0  usee. 

2.0  usee. 

4.0  usee 

4K 

$4000 

$3000 

$2000 

8k 

7000 

5000 

3000 

16K 

10000 

7000 

4000 

Processor: 

0.5  usee. 

*50000 

1.0  usee. 

$20000 

2.0  usee 

$10000 

Switch: 

$500/connection 

Table  1 .  Example  1  Costs 


Design 

m 

n 

*c 

s 

UER 

Cost 

Cost/UER 

1 

4 

3 

1.0 

2.0 

1.15 

76000 

6.61  * 

2 

4 

2 

1.0 

1.0 

1*25 

84000 

6.72 

3 

4 

4 

1.0 

2.0 

1*49 

88000 

5.91 

4 

4 

1 

1.0 

0.5 

1.00 

92000 

9. 20 

Units: 

t  , t  —  usee, 
c  p  /* 

UER  —  10  instructions/sec. 

Cost  —  $  p 

Cost/uER  —  10“^S/instruction/sec. 


Table  2.  Example  1  Designs 
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costs  appreciably  more  than  the  best  multiprocessor  system 
and  has  an  appreciably  higher  cost/performance  ratio. 

D.  Example  2:  Minimization  of  Cost/Performance  Ratio 

For  this  example  we  use  cost  data  from  a  real 
computer  system:  che  Digital  Equipment  Corporation  PDP-10. 
This  is  a  36  bit  word,  single  address  instruction  format 
computer  which  has  facilities  that  enable  it  to  be  used 
in  a  multiprocessor  configuration.  The  components  available 
to  build  PDP-10  systems  and  their  related  costs  are  given 
in  table  3«  In  order  to  put  the  example  in  realistic 
terms,  we  will  deal  with  the  actual  IER  of  a  typical  but 
simplified  instruction  mix.  The  mix  chosen  has  a  scien¬ 
tific  computational  bias:  20%  floating  point  multiply, 

30%  fixed  point  add,  20%  branch,  and  30%  load/store.  The 
branch  instruction  is  not  in  the  single  address  format; 
it  has  no  operand  reference.  The  PDP-10  System  Manual 
(Digital  Equipment  Corporation,  1968)  gives  a  rather  elab¬ 
orate  breakdown  of  the  processor  execution  times  (as  dis¬ 
tinguished  from  instruction  execution  times)  for  the  various 
instructions  and  the  value  of  t^  for  the  mix  above  is  com¬ 
puted  to  be  1.08  usee.  Taking  in  to  account  the  branch 
instruction,  there  are  an  average  of  1.8  memory  references 
per .instruction;  hence  the  actual  IER  =  UER/1.8. 

The  design  objectives  and  constraints  are  as 


follows: 
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Memory: 

Size 

1 .0  usee. 

1.8  usee 

8K 

$40000 

16K 

51000 

32K 

$70000 

64K 

112000 

128K 

196000 

Processor: 


1.08  usee. 
$1,51000 


Switch: 


$ 1500/connection 


Table  3»  PDP-10  Costs 


(Costs  obtained  from  "PEP- 10  Pricing  Summary",  Digital 
Equipment  Corporation, Maynard,  Mass.,  March  30,  1969) 


i 
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1 .  The  main  memory  size  must  be  256 K  words. 

g 

2.  The  IER  must  equal  or  exceed  0.5  x  10 
instructions  per  second. 

3#  The  number  of  processors  must  not  exceed 
four. 

4.  The  cost/performance  ratio  must  be  minimized. 

The  eleven  system  configurations  which  meet  con¬ 
straints  one,  two,  and  three  are  given  in  table  4.  The 
system  which  meets  objective  four  is  indicated  by  an  aster¬ 
isk.  As  in  example  1 ,  the  best  per formanc e/cost  ratio  is 
obtained  with  four  processors.  Objective  two  is  such  that 
it  cannot  be  met  with  less  than  two  processors;  the  best 
two  processor  organization  has  a  cost/performance  ratio 
about  20%  higher  than  the  four  processor  organization. 


f* 


Design 

m 

n 

*c 

IER 

Cost 

Cost/IER 

1 

16 

2 

1.0 

0.67 

1.12 

1.67 

2 

4 

2 

1.8 

0.55 

0.75 

1.38 

3 

8 

2 

1.8 

0.58 

0.86 

1.48 

4 

16 

3 

1.0 

0.98 

1.27 

1.29 

5 

2 

3 

1.8 

0.54 

0.85 

1.56 

6 

4 

3 

1.8 

0.72 

0.90 

1.26 

7 

8 

3 

1.8 

0.8  2 

1.02 

1.23 

8 

16 

4 

1.0 

1.29 

1.42 

1.11  * 

9 

2 

4 

1.8 

0.58 

1.00 

1.72 

10 

4 

4 

1.8 

0.85 

1.06 

1.24 

11 

8 

4 

1.8 

1.03 

1.17 

1.13 

Units: 

t  —  usee, 
c  c 

IER  —  10  ,-instructions/sec. 
Cost  —  10b$ 

Cost/IER  —  S/instruction/sec. 


Table  4,  PDP-10  Designs 
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Chapter  VIII  Conclusion 
A.  Summary 

In  the  preceding  chapters  we  have  presented  a 
series  of  analytic  models  which  quantitatively  relate  the 
performance  of  a  certain  class  of  computer  structures  to 
the  basic  component  variables.  We  have  corroborated  the 
analysis  with  simulation  studies  and  used  it  in  a  simple  auto¬ 
matic  design  program.  As  stated  in  chapter  I,  a  major  goal 
of  this  thesis  is  to  derive  analytic  models  whose  use  would 
facilitate  the  design  of  digital  computers.  The  question  now 
arises:  to  what  extent  the  analysis  is  actually  useful  in 

the  computer  design  process  and  what  extentions,  if  any, 
might  be  made  to  make  it  more  useful?  The  answer  is  sug¬ 
gested  by  a  review  of  the  design  program  of  chapter  VII. 

In  example  1  (of  chapter  VII)  there  are  over  100  potential 
computer  structures  which  meet  the  processor  and  memory  con¬ 
straints.  In  less  than  0.2  seconds  of  Univac  1108  computer 
time  these  structures  were  determined,  their  performance  and 
costs  evaluated,  and  the  optimum  structure  picked.  The 
evaluations  of  the  sub-optimal  structures  allow  us  to  in¬ 
teract  with  the  design  process  in  the  sense  that  we  can  see 
how  the  optimal  structure  changes  if  the  design  objectives 
or  Constraints  are  changed  somewhat. 

For  the  computer  designer  this  type  of  activity 
is  an  economical  tool  to  generate  an  initial  set  of  design 
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alternatives.  At  the  present  state  of  development,  the 
design  program  clearly  does  not  design  computers,  but  it 
does  make  some  preliminary  steps  toward  that  objective. 

The  generation  of  initial  design  structures  and  the  in¬ 
sight  gained  from  the  models  into  how  those  structures 
behave  certainly  is  valuable  to  the  computer  designer  and 
hence  this  thesis  has  probably  succeeded  in  its  objectives. 
There  does  remain,  however,  a  most  interesting  prospect 
for  further  automation  of  the  computer  design  process  and 
feel  that,  with  so&e  extentions  of  the  models  and  the 
design  program,  this  prospect  can  be  realized. 

B.  Extentions  of  the  Models 

In  chapter  II  it  was  assumed  that  all  instructions 
and  operands  occupied  one  memory  word  and  that  instructions 
were  of  the  single  address  format  type.  In  most  real  com¬ 
puter^  such  a  situation  does  not  exist.  Normally  there  are 
at  least  three  different  operand  formats:  fixed  point 
numbers,  floating  point  numbers,  and  symbols  (character 
strings);  and  there  are  multiple  instruction  formats.  In 
general,  each  of  the  operand  and  instruction  formats  is  of 
a  different  size.  Since  the  primary  memory  word  size  is 
fixed,  efficient  utilization  of  the  memory  requires  that 
one  or  both  of  the  following  techniques  be  employed:  (1) 
pack  more  than  one  instruction  or  operand  in  a  memory  word 
or  (2)  use  more  than  one  memory  word  to  hold  an  instruction 
or  operand.  Technique  (2)  usually  slows  down  the  IER  re- 
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lative  to  the  case  where  there  is  one  instruction  (or 
operand)  per  memory  word  since  more  than  two  memory  refer¬ 
ences  are  required  to  execute  an  instruction.  Technique 
(1 )  may  actually  result  in  an  increased  IER.  If  two  in¬ 
structions  are  packed  in  a  memory  word,  the  processor  can 
obtain  the  current  and  succeeding  instructions  with  just 
one  memory  reference. 

As  an  illustration  of  the  foregoing,  consider  a 
computer  which  has  the  following  formats: 

1.  24  bit  instructions 

2.  24  bit  fixed  point  numbers 

3.  48  bit  floating  point  numbers 

4.  8  bit  characters. 

The  main  memory  word  size  for  this  computer  might  be  48,  24, 
or  even  eight  bits.  The  choice  depends  on  the  relative 
use  of  the  various  formats  and  the  level  of  performance 
desired.  A  high  performance  computer  with  heavy  floating 
point  usage  would  obviously  have  a  48  bit  word  size  while  a 
lower  performance  computer  primarily  manipulating  characters 
would  probably  have  an  eight  or  a  24  bit  memory  word  size. 

The  foregoing  type  of  considerations  represent 
a  major  activity  in  the  computer  design  process.  The  design 
program  of  chapter  VII  could  be  extended  to  handle  consider¬ 
ations  of  this  type  if  the  appropriate  input  information 
were  available.  This  information  would  include  the  relative 
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usage  of  the  various  formats  and  their  sizes.  Various 
primary  memory  sizes  could  be  tried  by  the  design  program, 
and  for  each  size  a  relationship  r  between  the  IER  and  HER 
would  be  defined  (such  that  r  z  IER  ■  UER),  as  well  as  a  value 
of  tp.  These  definitions  would  essentially  be  a  formali¬ 
zation  of  the  ideas  that  were  discussed  in  section  fi  of 
chapter  IV.  The  value  of  tp  would  be  defined  as  the  aver¬ 
age  amount  of  processor  time  per  memory  reference  and  it  is 
a  function  not  only  of  the  processor  speed  but  also  of  the 
memory  organization  and  the  instruction  and  operand  formats. 
The  value  of  r  would  be  defined  as  the  average  number  of 
memory  references  made  to  ezecute  an  instruction  and  it  is 
also  a  function  of  the  way  operands  and  data  are  mapped  into 
the  memory.  With  both  r  and  tp  defined,  the  design  program 
would  be  essentially  as  that  in  chapter  VII. 

C.  A  Proposal  for  Continued  Work 

It  is  proposed  that  a  design  program  be  imple¬ 
mented  which  would  specify  the  high  level  structure  for  a 

computer  so  that  a  desired  IER  is  realized  and  that  the  cost 

I  i.’O'W  ilU-v 

of  the  design  is  a  minimum.  (Other  design  objectives  and 
constraints  involving  costs  and  performance  could  of  course 
be  used.)  Such  a  program  would  undoubtedly  be  interactive 
so  that  the  designer  could  see  the  effect  of  varying  the 
input  design  objectives  and  constraints.  The  inputs  to  the 
design  program  would  be  the  following: 


1  •  The  component  costs ; 
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2.  The  desired  IER  and  other  constraints  such 
as  total  memory  size,  limits  on  the  number  of 
processors,  and  so  forth;  (The  constraints  might 
be  specified  In  complex  ways.  The  total  memory 
size  might  be  made  a  function  of  the  number  of 
processors  with  the  memory  size  increasing  as  the 
number  of  processors  increase.) 

3.  The  instruction  and  operand  formats  and  their 
associated  relative  frequencies; 

4.  The  i/o  activity —  possibly  as  a  function 
of  the  total  memory  size  and  the  number  of  pro¬ 
cess  or  a  (Indeed,  if  the  i/o  activity  were  known 
as  a  function  of  memory  size,  the  memory  size 
might  be  made  a  design  variable  rather  than  a 
constraint.) 

The  design  program  would  then  pick  the  number  of  memories  and 
processors,  the  memory  word  size  (and  possibly  the  total 
memory  size),  and  the  memory  and  processor  speeds.  Since 
this  design  program  essentially  includes  the  Instruction 
set  as  one  of  its  inputs,  it  might  be  possible  to  inter¬ 
connect  it  with  a  design  program  which  specifies  instruction 
sets  (Haney,  1968)  thus  potentially  extending  design  auto¬ 
mation  to  cover  several  levels  of  the  computer  design  pro¬ 


cess. 
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Appendix 

The  simulator  handles  three  classes  of  instruc¬ 
tions: 

1 .  Class  1 :  Single  address  format 

2.  Class  2:  Instruction  without  operand  reference 
(like  a  unit  instruction) 

3.  Class  3:  Write  instructions  (no  processor 
execution  time  and  an  operand  memory  reference 
time  of  tw). 

An  instruction  execution  consits  of  two  phases:  (1)  the 
instruction  reference  and  decode  and  (2)  the  operand  re¬ 
ference  and  execute.  Class  two  instructions  have  no  phase 
two;  for  them  the  execute  is  accomplished  in  phase  one.  In 
the  simulator  each  phase  is  handled  as  the  execution  of  a 
unit  instruction. 
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«  alg  SI  mg 

COMPILED  OY  CMU  ) 

this  compilation 
i 


BLOCK  I 


BLOCK  2 


BLOCK  3 


LEVEL 


|  HA  ALGOL  i>.\Ti;n  SI  JULY  A*  (VERSION  1 0  A  ) 

•VAS  DON*:  ON  n.  may  70  AT  15  5  25  5  O'* 

BEGIN 

FORMAT  FII'  M* . XS* ,X5. * TC* »XA, »TW» ,XA, »TP* i 
X7»»R»,XA»*WP*»XJ  •*«/•??  »»A|.ZH 
FORMAT  F2  (  2  (  !  3  .  XT  »  ,  A  (IV,  .2  ,XT  »  .  A  1  .  I  )  * 

INTEGER  NI»K»U*I»JiL,R,S*7»Y*M,M,CC5 

REAL  DiA.O.T«,TA,TC,Tn,TTP,V.X,M.T,RO,RPf 

READ ( N 1 ) S 

Z«a  I  92*s 

Y*20HRS 

BEG  I  N 

COMMENT  1C  IS  THE  INSTRUCTION  CLASS,  Fl  IS  THE 
INSTRUCTION  RELATIVE  FRFQNENCY ,  AND  IT  JS  THE 


level 


P  R  0  f  f. ' 


FCUTIOM  T  I  *1E*J 


LEVEL 


INTEGER  array  IC(I,.NI)S 
REAL  ARRAY  CP  .  F I , I T (> . . M I > * 
comment  read  in  the  instruction  set  variables* 
FOR  K“ I l , ) . N I )  DO 

begin 

READ(JC(K>»IT(K)»FI(K) )* 

A»A  +  FMK)S 

E  U  0  5 

COMMENT  NORMALIZE  RELATIVE  FREOUENCIES  AND 
COMPUTE  C  UN  MU  I.  AT  I  VE  PROB  AO  H  I  T  1  ESS 
CP(|)»F!lll«FII!i/AS 
FOR  K  8  I / * ) * M  t )  DO 
BEGIN 

F  I  I  Y-  I  *f  1  (►  5  /As 

c  p  i  n *  c  p i r - 1 >+ei ik)« 

ENDS: 

COMMENT  U  15  THE  NUMBER  OF  SIMULATIONS  TO  BE 
PUN* 

PEACMUIS, 

COMMENT  REAP  MEMORY  RESTORE , MEMORY  CYCLE  AND 
INSTRUCTION  OF  CODE  TlMFSJF 
R  E  A  D  (  T  V.' ,  T  C  ,TD>S 
TA=TC-Trs 
Y«  R  )  T  E  C I  I  J  S 
FOR  K  * ( 1  »I,U)  DO 
BEGIN 

R  t:  A  0  (  M  »  N  )  S 
BEGIN 
3 

INTEGER  A  P  R  t  Y  P,C(l..N>* 

REAL  ARRAY  TT  ,TPU  .«N>  »TM(  1  .fM|« 
FOR  L"< 1 • I  ,N )  PO  P  I  L  )  *  1  * 

R»S*I*1 S 

COMMENT  CC  COUNTS  THE  NUMBER  OF  UNIT 
INSTRUCTION?  EXECUTED* 

FOR  CC*  C I , [  , Y )  00 
BEGIN 

COMMENT  CHOOSE -A  MEMORY  --  PETEPM I NE  J* 


FOP  L=I ,2 ,3  DO 

J*  (  R*M  )  //Z-M  $ 


R  =  MOD (5*R  ,Z ) * 


63 

69 

56 

66 

5  7 

sa 

69 
60 
6) 

6  7 

63 

64 
66 
66 
67 
SB 
6® 

70 

71 

72 
71 

74 

75 

76 

77 

78 

79  . 

80 
81 
87 

8  3 

84 

85  _ 

8  6 

87 

flfl 

8V 

90 

9  1 _ 

92 

93 

94 

95 

96 
9  7 
9  1 
99 

100 

101 

102 

103 

104 

105 

106 
I  0  7 
109 

109 

110 


COMMENT  CHOOSE  A  PROCESSOR  --  Q.TERMINE  IS 

FOR  L«( | , 1 ,N>  DO  IF  TPCL  ) 

LSS  TP  C  l  J  THEN  I»L* 

COMMENT  IF  PHASE  ONE  CHOOSE  A  NEW  INSTRUCTION. 
generate  A  RANDOM  NUMBER  IN  (O.t)  and  choose 
INSTRUCTION  WHOSE  CUMMULATIVE  PROBABILITY  FIRST 
EXCEEDS  THAT  NUMBERS 

if  pm  eol  i  then 

--  ~  BEGIN  - 

FOR  L« |  | 2 . 3  DO  S-MOD  I  5 *S  ,  2 > * 
Q*5/ZS 
L-1S 

FOR  L*L  WHILE  P  GTR  CP(L) 

DO  LaL* I  * 
c i n ■  i  c i Li s 

COMMENT  tt»procfssor  EXECUTION  time  for  the 
SELECTED  INSTRUCTION.  UPDATE  MEMORY  AND 
PROCESSOR  TIMES  (INSTRUCTION  REFERENCE  PHASElS 

TT(!»-!t|US 

TM(Jl»MAXiTP( I  I »  TM ( J ) )*TCf 

-  TPt I  laTMC Jl-TW  +  TOS  - 

COMMENT  IF  INSTRUCTION  IS  OF  CLASS  2  NO 
ADDITIONAL  MEMORY  REFERENCES  NEED  BE  MADE • 

UPDATE  PROCESSOR  TIME  BY  EXECUTE  TIME.  PHASE 
REMAINS  ONE  SO  THAT  A  NEW  INSTRUCTION  IS  CHOSEN 
FOR  PROCESSOR  I.  IF  INSTRUCTION  IS  NOT  OF 

-  -CLASS  2,  THE -PHASE  IS  SET  TO  2S _ 

IF  C ( T  >  EQL  2  THEN  TPII)« 
TP(I1+Tt(I)  ELSE  P(I)-25 
ENO  ELSE 

BEGIN  _ 

Pill'll 

COMMENT  UPDATE  MEMORY  ANO  PROCESSOR  TIMES _ 

(INSTRUCTION  EXECUTION  PHASE)* 

if  cm  eql  i  then 
begin 

-  -  TM( J)"MAX (TM( J) fTP(I ) > 

4TC* 

-  - -  -  ..  .T  P  (  I  »  •  T  M  (  J  J  “  T  IN  f-T  T-dJS _ 

END  ELSE 

TM(  J)«MAX.(TM(  J)  » TP  (  I  )  )  ♦TIN* 

ends 

ENDS 
A  "OS 

—  _  _  FOR  L-U  ,  I  ,M)  DO-APAtTMIL).* _ 

FOR  La(l,l,N)  00  AaA*TP(L)S 
COMMENT  COMPUTE  RO  THE  OBSERVED  RATE  OF  . 
INSTRUCTION  EXECUTIONS 

ROaY* (M+Nl/A* 

A  "OS 

_ _ _  _ FOR  L-( l .1 ,NI I „D0  A-A  +  ITILIS . . 

TTP«(TD+A/NI )/2* 

COMMENT  COMPUTE  the  AVERAGE  processor  ACTIVT.Y 
TIME.  FORM  the  DIFFERENCE  of  TTP  ANO  TWS 
COMMENT  IF  D  IS  LESS  THAN  OR  EQUAL.  TO  ZERO 
COMPUTE  HP  THE  PREDICTED  RATE  OF  INSTRUCTION 
EXECUTIONS _ _  __ _ _ _ 

D"TTP-TWS 
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1  I  1 
t  I  2 
I  13 
I  I  S 
115 
I  1  6 
1  1  7 
1  1  3 
i  r> 
1  20 
l  2  l 
1-72 
I  23 
1  2R 
1  2!i. 


IF  D  LEQ  0  THEN 
BEGIN 

V* (  J -  I /M ) ««N$ 
RP»M«(1-V4/(TC+V*P)S 

end  else 

BEGIN 

CGMMEfir  SUBSTITUTE  M  EQUATION  FOR  THE 
EXECUTION  RATE  ivhERE  TP  IS  GREATER  THAN  T«  THE 

RELATION  X=f 1-PM/M) ••U,  then  use  newton- 

RAPhS«:j  search  to  FIND  x»  THE  STARTING  VALUE 
OF  X  IS  ONE.  THE  SEARCH  STOPS  'YHEN  SUCCESSIVE 
ITERATIONS  ON  X  DIFFER  BY  LESS  THAN  ,001$ 

X  a  |  S 

T*0/(M*TC)S 

. .  .  . . .  FOR  H«  I  T •  X  * •N  +  X**  !.♦  I  /M"T  )  / 


1  26 
127 
I  20 
129 
1  30 


131 

ENO  ULOCK  3 

132  . 

133  END 
ENO  BLOCK  2 

13*4  END* 

EUP  BLOCK,  l 
1  30  EIJOs 


IN*X**(M-1 )+l ) 

while  h  gtr  o«noi  do  x»x-ks 

R P  =  M •  (  t-x**N)/TC$ 

END$ 

WR!TE<F2,m,n,TC,TW,TTP,R0.RP* 

RO/RPIS 

ENDS 


COMPILATION  COMPLETE 
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